1
|
Su H, Sun Y, Li R, Zhang A, Yang Y, Xiao F, Duan Z, Chen J, Hu Q, Yang T, Xu B, Zhang Q, Zhao J, Li Y, Li H. Large Language Models in Medical Diagnostics: Scoping Review With Bibliometric Analysis. J Med Internet Res 2025; 27:e72062. [PMID: 40489764 DOI: 10.2196/72062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2025] [Revised: 03/24/2025] [Accepted: 04/21/2025] [Indexed: 06/11/2025] Open
Abstract
BACKGROUND The integration of large language models (LLMs) into medical diagnostics has garnered substantial attention due to their potential to enhance diagnostic accuracy, streamline clinical workflows, and address health care disparities. However, the rapid evolution of LLM research necessitates a comprehensive synthesis of their applications, challenges, and future directions. OBJECTIVE This scoping review aimed to provide an overview of the current state of research regarding the use of LLMs in medical diagnostics. The study sought to answer four primary subquestions, as follows: (1) Which LLMs are commonly used? (2) How are LLMs assessed in diagnosis? (3) What is the current performance of LLMs in diagnosing diseases? (4) Which medical domains are investigating the application of LLMs? METHODS This scoping review was conducted according to the Joanna Briggs Institute Manual for Evidence Synthesis and adheres to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews). Relevant literature was searched from the Web of Science, PubMed, Embase, IEEE Xplore, and ACM Digital Library databases from 2022 to 2025. Articles were screened and selected based on predefined inclusion and exclusion criteria. Bibliometric analysis was performed using VOSviewer to identify major research clusters and trends. Data extraction included details on LLM types, application domains, and performance metrics. RESULTS The field is rapidly expanding, with a surge in publications after 2023. GPT-4 and its variants dominated research (70/95, 74% of studies), followed by GPT-3.5 (34/95, 36%). Key applications included disease classification (text or image-based), medical question answering, and diagnostic content generation. LLMs demonstrated high accuracy in specialties like radiology, psychiatry, and neurology but exhibited biases in race, gender, and cost predictions. Ethical concerns, including privacy risks and model hallucination, alongside regulatory fragmentation, were critical barriers to clinical adoption. CONCLUSIONS LLMs hold transformative potential for medical diagnostics but require rigorous validation, bias mitigation, and multimodal integration to address real-world complexities. Future research should prioritize explainable artificial intelligence frameworks, specialty-specific optimization, and international regulatory harmonization to ensure equitable and safe clinical deployment.
Collapse
Affiliation(s)
- Hankun Su
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
- Xiangya School of Medicine, Central South University, Changsha, China
| | - Yuanyuan Sun
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Ruiting Li
- School of Biomedical Sciences and Engineering, South China University of Technology, Guangzhou, China
| | - Aozhe Zhang
- Xiangya School of Medicine, Central South University, Changsha, China
| | - Yuemeng Yang
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
- Xiangya School of Medicine, Central South University, Changsha, China
| | - Fen Xiao
- Department of Metabolism and Endocrinology, Second Xiangya Hospital of Central South University, Changsha, China
| | - Zhiying Duan
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Jingjing Chen
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Qin Hu
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Tianli Yang
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Bin Xu
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Qiong Zhang
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Jing Zhao
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Yanping Li
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| | - Hui Li
- Department of Reproductive Medicine, Xiangya Hospital Central South University, Changsha, China
- Clinical Research Center for Women's Reproductive Health in Hunan Province, Changsha, China
| |
Collapse
|
2
|
Huang S, Wen C, Bai X, Li S, Wang S, Wang X, Yang D. Exploring the Application Capability of ChatGPT as an Instructor in Skills Education for Dental Medical Students: Randomized Controlled Trial. J Med Internet Res 2025; 27:e68538. [PMID: 40424023 DOI: 10.2196/68538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2024] [Revised: 03/16/2025] [Accepted: 03/22/2025] [Indexed: 05/28/2025] Open
Abstract
BACKGROUND Clinical operative skills training is a critical component of preclinical education for dental students. Although technology-assisted instruction, such as virtual reality and simulators, is increasingly being integrated, direct guidance from instructors remains the cornerstone of skill development. ChatGPT, an advanced conversational artificial intelligence model developed by OpenAI, is gradually being used in medical education. OBJECTIVE This study aimed to compare the effects of ChatGPT-assisted skill learning on performance, cognitive load, self-efficacy, learning motivation, and spatial ability, with the aim of evaluating the potential of ChatGPT in clinical operative skills education. METHODS In this study, 187 undergraduate dental students recruited from a first-class university in China were randomly divided into a ChatGPT group and a blank control group. Among them, the control group used videos for skill acquisition, and the ChatGPT group used ChatGPT in addition to the videos. After 1 week of intervention, skills were tested using desktop virtual reality, and cognitive load was measured by recording changes in pupil diameter with an eye tracker. In addition, a spatial ability test was administered to analyze the effect of ChatGPT on those with different spatial abilities. Finally, a questionnaire was also used to assess cognitive load and self-efficacy during the learning process. RESULTS A total of 192 dental undergraduates from a top-tier Chinese university were initially recruited for the experiment by October 25, 2024. Following eye-tracking calibration procedures, 5 participants were excluded, resulting in 187 eligible students successfully completing the experimental protocol by November 2, 2024. Following a short-term intervention administered through randomized allocation, superior performance (ChatGPT group: mean 73.12, SD 10.06; control group: mean 65.54, SD 12.48; P<.001) was observed among participants in the ChatGPT group, along with higher levels of self-efficacy (P=.04) and learning motivation (P=.02). In addition, cognitive load was lower in the ChatGPT group according to eye-tracking measures (ChatGPT group: mean 0.137, SD 0.036; control group: mean 0.312, SD 0.032; P<.001). The analysis of the learning performance of participants with different spatial abilities in the 2 modalities showed that compared to the learners with high spatial abilities (ChatGPT group: mean 76.58, SD 9.23; control group: mean 73.89, SD 11.75; P=.22), those with low spatial abilities (ChatGPT group: mean 70.20, SD 10.71; control group: mean 55.41, SD 13.31; P<.001) were more positively influenced by ChatGPT. CONCLUSIONS ChatGPT has performed outstandingly in assisting dental skill learning, and the study supports the integration of ChatGPT into skills teaching and provides new ideas for modernizing skill teaching. TRIAL REGISTRATION ClinicalTrials.gov NCT06942130;https://clinicaltrials.gov/study/NCT06942130.
Collapse
Affiliation(s)
- Siyu Huang
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| | - Chang Wen
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
- Center for Orthodontics and Pediatric Dentistry at Optics Valley Branch, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| | - Xueying Bai
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| | - Sihong Li
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| | - Shuining Wang
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| | - Xiaoxuan Wang
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
- Department of Periodontology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| | - Dong Yang
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
- Department of Periodontology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| |
Collapse
|
3
|
Inoue T, Sawamura S, Sera T, Takenaka T, Kohiyama K, Nagai T. ChatGPT and Occupational Therapy: A Study of Generated Program Feasibility. Cureus 2025; 17:e83761. [PMID: 40491651 PMCID: PMC12146438 DOI: 10.7759/cureus.83761] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/08/2025] [Indexed: 06/11/2025] Open
Abstract
Introduction As the application of large language models (LLMs) in the medical field advances, the potential for creating occupational therapy (OT) programs remains unexplored. This study aimed to clarify the ability of Generative Pre-trained Transformer (GPT; OpenAI, San Francisco, CA, USA) to create OT programs. Methods Based on five case reports of patients with stroke and concomitant psychological symptoms, GPT was instructed to create OT programs. Five occupational therapists (OTRs) evaluated the generated programs and quantified the degree of agreement with the programs created by OTRs. Results The programs generated by GPT showed a low degree of agreement with the programs created by OTRs in all cases, with a rating of two points or less. The content was found to be general, lacking in specificity and individuality, and insufficiently specialized. The scores for all programs generated by GPT were 2/4 or lower. Discussion While GPT has difficulty creating OT programs based on patients' life backgrounds and specialized knowledge, it showed the potential to be used in some processes of OT program creation. This is likely due to a lack of pretraining and limitations in information.
Collapse
Affiliation(s)
- Tadatoshi Inoue
- Department of Rehabilitation, Heisei College of Health Sciences, Gifu, JPN
| | - Shogo Sawamura
- Department of Rehabilitation, Heisei College of Health Sciences, Gifu, JPN
| | - Tatsuya Sera
- Department of Rehabilitation, Heisei College of Health Sciences, Gifu, JPN
| | - Takahiro Takenaka
- Department of Rehabilitation, Heisei College of Health Sciences, Gifu, JPN
| | - Kengo Kohiyama
- Department of Rehabilitation, Heisei College of Health Sciences, Gifu, JPN
| | - Takashi Nagai
- Department of Rehabilitation, Heisei College of Health Sciences, Gifu, JPN
| |
Collapse
|
4
|
Rezazadeh H, Mahani AM, Salajegheh M. Insights Into the Future: Assessing Medical Students' Artificial Intelligence Readiness - A Cross-Sectional Study at Kerman University of Medical Sciences (2022). Health Sci Rep 2025; 8:e70870. [PMID: 40432697 PMCID: PMC12106343 DOI: 10.1002/hsr2.70870] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2024] [Revised: 02/16/2025] [Accepted: 05/07/2025] [Indexed: 05/29/2025] Open
Abstract
Background Artificial intelligence (AI) has recently advanced in medicine globally, transforming healthcare delivery and medical education. While AI integration into medical curricula is gaining momentum worldwide, research on medical students' preparedness remains limited, particularly in developing countries. This paper aims to investigate the readiness of medical students at the Kerman University of Medical Sciences to employ AI in medicine in 2022. Methods This cross-sectional research was carried out by distributing the validated 20-item Medical Artificial Intelligence Readiness Scale for Medical Students (MAIRS-MS) among 360 medical students, with a response rate of 94% (n = 340). The MAIRS-MS assessed four domains, including cognition (8 items), ability (7 items), vision (2 items), and ethics (3 items), using a 5-point Likert scale. Data analysis was conducted by descriptive statistics and independent sample t-tests in SPSS v24.0, considering p < 0.05 significant. Results Participants demonstrated below-average readiness scores across all domains: ability (M = 21.88 ± 6.74, 62.5% of the maximum possible score), cognition (M = 20.30 ± 7.04, 50.8%), ethics (M = 10.94 ± 3.04, 72.9%), and vision (M = 6.09 ± 1.94, 60.9%). The total mean readiness score was 59.21 ± 16.12 (59.2% of the maximum). The highest and lowest-rated items were "value of AI in education" (3.96 ± 1.18) and "explaining AI system training" (2.10 ± 1.01), respectively. No significant differences were found across demographic factors (p > 0.05). Conclusion Iranian medical students currently show limited readiness for AI integration in healthcare practice. Therefore, the study recommends: (1) implementing structured introductory AI courses in medical curricula, focusing particularly on technical fundamentals and practical applications, and (2) developing hands-on training programs that combine AI concepts with clinical scenarios. These findings provide valuable insights for curriculum development and educational policy in medical education.
Collapse
Affiliation(s)
- Hossein Rezazadeh
- Student Committee of Medical Education Development, Education Development CenterKerman University of Medical SciencesKermanIran
| | - Ali Madadi Mahani
- Student Committee of Medical Education Development, Education Development CenterKerman University of Medical SciencesKermanIran
| | - Mahla Salajegheh
- Department of Medical Education, Medical Education Development CenterKerman University of Medical SciencesKermanIran
| |
Collapse
|
5
|
Sridhar GR, Gumpeny L. Prospects and perils of ChatGPT in diabetes. World J Diabetes 2025; 16:98408. [PMID: 40093292 PMCID: PMC11885976 DOI: 10.4239/wjd.v16.i3.98408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 11/05/2024] [Accepted: 12/03/2024] [Indexed: 01/21/2025] Open
Abstract
ChatGPT, a popular large language model developed by OpenAI, has the potential to transform the management of diabetes mellitus. It is a conversational artificial intelligence model trained on extensive datasets, although not specifically health-related. The development and core components of ChatGPT include neural networks and machine learning. Since the current model is not yet developed on diabetes-related datasets, it has limitations such as the risk of inaccuracies and the need for human supervision. Nevertheless, it has the potential to aid in patient engagement, medical education, and clinical decision support. In diabetes management, it can contribute to patient education, personalized dietary guidelines, and providing emotional support. Specifically, it is being tested in clinical scenarios such as assessment of obesity, screening for diabetic retinopathy, and provision of guidelines for the management of diabetic ketoacidosis. Ethical and legal considerations are essential before ChatGPT can be integrated into healthcare. Potential concerns relate to data privacy, accuracy of responses, and maintenance of the patient-doctor relationship. Ultimately, while ChatGPT and large language models hold immense potential to revolutionize diabetes care, one needs to weigh their limitations, ethical implications, and the need for human supervision. The integration promises a future of proactive, personalized, and patient-centric care in diabetes management.
Collapse
Affiliation(s)
- Gumpeny R Sridhar
- Department of Endocrinology and Diabetes, Endocrine and Diabetes Centre, Visakhapatnam 530002, Andhra Pradesh, India
| | - Lakshmi Gumpeny
- Department of Internal Medicine, Gayatri Vidya Parishad Institute of Healthcare & Medical Technology, Visakhapatnam 530048, Andhra Pradesh, India
| |
Collapse
|
6
|
García-Rudolph A, Sanchez-Pinsach D, Caridad Fernandez M, Cunyat S, Opisso E, Hernandez-Pena E. How Chatbots Respond to NCLEX-RN Practice Questions: Assessment of Google Gemini, GPT-3.5, and GPT-4. Nurs Educ Perspect 2025; 46:E18-E20. [PMID: 39692545 DOI: 10.1097/01.nep.0000000000001364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2024]
Abstract
ABSTRACT ChatGPT often "hallucinates" or misleads, underscoring the need for formal validation at the professional level for reliable use in nursing education. We evaluated two free chatbots (Google Gemini and GPT-3.5) and a commercial version (GPT-4) on 250 standardized questions from a simulated nursing licensure exam, which closely matches the content and complexity of the actual exam. Gemini achieved 73.2 percent (183/250), GPT-3.5 achieved 72 percent (180/250), and GPT-4 reached a notably higher performance with 92.4 percent (231/250). GPT-4 exhibited its highest error rate (13.3%) in the psychosocial integrity category.
Collapse
Affiliation(s)
- Alejandro García-Rudolph
- About the Authors Alejandro García-Rudolph, PhD; David Sanchez-Pinsach, PhD; Mira Caridad Fernandez, MSc; Sandra Cunyat, MSc; Eloy Opisso, PhD; and Elena Hernandez-Pena, MSc, are faculty, Institut Guttmann Hospital de Neurorehabilitació, Barcelona, Spain. The authors are grateful to Olga Araujo of the Institut Guttmann-Documentation Office for her support in accessing the literature. For more information, contact Dr. Alejandro García-Rudolph at
| | | | | | | | | | | |
Collapse
|
7
|
Ruta MR, Gaidici T, Irwin C, Lifshitz J. ChatGPT for Univariate Statistics: Validation of AI-Assisted Data Analysis in Healthcare Research. J Med Internet Res 2025; 27:e63550. [PMID: 39919289 PMCID: PMC11845875 DOI: 10.2196/63550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Revised: 10/19/2024] [Accepted: 11/24/2024] [Indexed: 02/09/2025] Open
Abstract
BACKGROUND ChatGPT, a conversational artificial intelligence developed by OpenAI, has rapidly become an invaluable tool for researchers. With the recent integration of Python code interpretation into the ChatGPT environment, there has been a significant increase in the potential utility of ChatGPT as a research tool, particularly in terms of data analysis applications. OBJECTIVE This study aimed to assess ChatGPT as a data analysis tool and provide researchers with a framework for applying ChatGPT to data management tasks, descriptive statistics, and inferential statistics. METHODS A subset of the National Inpatient Sample was extracted. Data analysis trials were divided into data processing, categorization, and tabulation, as well as descriptive and inferential statistics. For data processing, categorization, and tabulation assessments, ChatGPT was prompted to reclassify variables, subset variables, and present data, respectively. Descriptive statistics assessments included mean, SD, median, and IQR calculations. Inferential statistics assessments were conducted at varying levels of prompt specificity ("Basic," "Intermediate," and "Advanced"). Specific tests included chi-square, Pearson correlation, independent 2-sample t test, 1-way ANOVA, Fisher exact, Spearman correlation, Mann-Whitney U test, and Kruskal-Wallis H test. Outcomes from consecutive prompt-based trials were assessed against expected statistical values calculated in Python (Python Software Foundation), SAS (SAS Institute), and RStudio (Posit PBC). RESULTS ChatGPT accurately performed data processing, categorization, and tabulation across all trials. For descriptive statistics, it provided accurate means, SDs, medians, and IQRs across all trials. Inferential statistics accuracy against expected statistical values varied with prompt specificity: 32.5% accuracy for "Basic" prompts, 81.3% for "Intermediate" prompts, and 92.5% for "Advanced" prompts. CONCLUSIONS ChatGPT shows promise as a tool for exploratory data analysis, particularly for researchers with some statistical knowledge and limited programming expertise. However, its application requires careful prompt construction and human oversight to ensure accuracy. As a supplementary tool, ChatGPT can enhance data analysis efficiency and broaden research accessibility.
Collapse
Affiliation(s)
- Michael R Ruta
- University of Arizona College of Medicine - Phoenix, Phoenix, AZ, United States
| | - Tony Gaidici
- University of Arizona College of Medicine - Phoenix, Phoenix, AZ, United States
| | - Chase Irwin
- University of Arizona College of Medicine - Phoenix, Phoenix, AZ, United States
| | - Jonathan Lifshitz
- University of Arizona College of Medicine - Phoenix, Phoenix, AZ, United States
| |
Collapse
|
8
|
Burisch C, Bellary A, Breuckmann F, Ehlers J, Thal SC, Sellmann T, Gödde D. ChatGPT-4 Performance on German Continuing Medical Education-Friend or Foe (Trick or Treat)? Protocol for a Randomized Controlled Trial. JMIR Res Protoc 2025; 14:e63887. [PMID: 39913914 PMCID: PMC11843049 DOI: 10.2196/63887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Revised: 12/02/2024] [Accepted: 12/27/2024] [Indexed: 02/24/2025] Open
Abstract
BACKGROUND The increasing development and spread of artificial and assistive intelligence is opening up new areas of application not only in applied medicine but also in related fields such as continuing medical education (CME), which is part of the mandatory training program for medical doctors in Germany. This study aimed to determine whether medical laypersons can successfully conduct training courses specifically for physicians with the help of a large language model (LLM) such as ChatGPT-4. This study aims to qualitatively and quantitatively investigate the impact of using artificial intelligence (AI; specifically ChatGPT) on the acquisition of credit points in German postgraduate medical education. OBJECTIVE Using this approach, we wanted to test further possible applications of AI in the postgraduate medical education setting and obtain results for practical use. Depending on the results, the potential influence of LLMs such as ChatGPT-4 on CME will be discussed, for example, as part of a SWOT (strengths, weaknesses, opportunities, threats) analysis. METHODS We designed a randomized controlled trial, in which adult high school students attempt to solve CME tests across six medical specialties in three study arms in total with 18 CME training courses per study arm under different interventional conditions with varying amounts of permitted use of ChatGPT-4. Sample size calculation was performed including guess probability (20% correct answers, SD=40%; confidence level of 1-α=.95/α=.05; test power of 1-β=.95; P<.05). The study was registered at open scientific framework. RESULTS As of October 2024, the acquisition of data and students to participate in the trial is ongoing. Upon analysis of our acquired data, we predict our findings to be ready for publication as soon as early 2025. CONCLUSIONS We aim to prove that the advances in AI, especially LLMs such as ChatGPT-4 have considerable effects on medical laypersons' ability to successfully pass CME tests. The implications that this holds on how the concept of continuous medical education requires reevaluation are yet to be contemplated. TRIAL REGISTRATION OSF Registries 10.17605/OSF.IO/MZNUF; https://osf.io/mznuf. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID) PRR1-10.2196/63887.
Collapse
Affiliation(s)
- Christian Burisch
- State of North Rhine-Westphalia, Regional Government Düsseldorf, Leibniz-Gymnasium, Essen, Germany
- Department of Didactics and Education Research in the Health Sector, Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Abhav Bellary
- Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Frank Breuckmann
- Department of Cardiology, Pneumology, Neurology and Intensive Care Medicine, Klinik Kitzinger Land, Kitzingen, Germany
- Department of Cardiology and Vascular Medicine, West German Heart and Vascular Center Essen, University Duisburg-Essen, Essen, Germany
| | - Jan Ehlers
- Department of Didactics and Education Research in the Health Sector, Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Serge C Thal
- Department of Anesthesiology, HELIOS University Hospital, Wuppertal, Germany
- Department of Anaesthesiology I, Witten-Herdecke University, Witten, Germany
| | - Timur Sellmann
- Department of Anaesthesiology I, Witten-Herdecke University, Witten, Germany
- Department of Anesthesiology and Intensive Care Medicine, Evangelisches Krankenhaus Hospital, BETHESDA zu Duisburg, Duisburg, Germany
| | - Daniel Gödde
- Department of Pathology and Molecular Pathology, HELIOS University Hospital Wuppertal, University Witten/Herdecke, Witten, Germany
| |
Collapse
|
9
|
Schrager S, Seehusen DA, Sexton S, Richardson CR, Neher J, Pimlott N, Bowman MA, Rodríguez J, Morley CP, Li L, Dera JD. Use of AI in Family Medicine Publications: A Joint Editorial From Journal Editors. Ann Fam Med 2025; 23:1-4. [PMID: 39805694 PMCID: PMC11772029 DOI: 10.1370/afm.240575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/19/2024] [Accepted: 11/19/2024] [Indexed: 01/16/2025] Open
Affiliation(s)
- Sarina Schrager
- Department of Family Medicine and Community Health, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin
| | - Dean A Seehusen
- Department of Family and Community Medicine, Medical College of Georgia, Augusta University, Augusta, Georgia
| | - Sumi Sexton
- Georgetown University School of Medicine, Washington, DC
| | | | - Jon Neher
- Valley Family Medicine Residency Program, Renton, Washington
| | - Nicholas Pimlott
- Department of Family and Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | | | - José Rodríguez
- Department of Family and Preventive Medicine, Spencer Fox Eccles Schol of Medicine, University of Utah Health, Salt Lake City, Utah
| | - Christopher P Morley
- Department of Public Health and Preventive Medicine and Family Medicine, SUNY Upstate Medical University, Syracuse, New York
| | - Li Li
- Department of Family Medicine, University of Virginia, Charlottesville, Virginia
| | | |
Collapse
|
10
|
García-Rudolph A, Sanchez-Pinsach D, Opisso E. Evaluating AI Models: Performance Validation Using Formal Multiple-Choice Questions in Neuropsychology. Arch Clin Neuropsychol 2025; 40:150-155. [PMID: 39231527 DOI: 10.1093/arclin/acae068] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2024] [Revised: 08/13/2024] [Accepted: 08/19/2024] [Indexed: 09/06/2024] Open
Abstract
High-quality and accessible education is crucial for advancing neuropsychology. A recent study identified key barriers to board certification in clinical neuropsychology, such as time constraints and insufficient specialized knowledge. To address these challenges, this study explored the capabilities of advanced Artificial Intelligence (AI) language models, GPT-3.5 (free-version) and GPT-4.0 (under-subscription version), by evaluating their performance on 300 American Board of Professional Psychology in Clinical Neuropsychology-like questions. The results indicate that GPT-4.0 achieved a higher accuracy rate of 80.0% compared to GPT-3.5's 65.7%. In the "Assessment" category, GPT-4.0 demonstrated a notable improvement with an accuracy rate of 73.4% compared to GPT-3.5's 58.6% (p = 0.012). The "Assessment" category, which comprised 128 questions and exhibited the highest error rate by both AI models, was analyzed. A thematic analysis of the 26 incorrectly answered questions revealed 8 main themes and 17 specific codes, highlighting significant gaps in areas such as "Neurodegenerative Diseases" and "Neuropsychological Testing and Interpretation."
Collapse
Affiliation(s)
- Alejandro García-Rudolph
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Badalona, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Badalona, Barcelona, Spain
| | - David Sanchez-Pinsach
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Badalona, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Badalona, Barcelona, Spain
| | - Eloy Opisso
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Badalona, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Badalona, Barcelona, Spain
| |
Collapse
|
11
|
Schrager S, Seehusen DA, Sexton SM, Richardson C, Neher J, Pimlott N, Bowman M, Rodríguez JE, Morley CP, Li L, DomDera J. Use of AI in family medicine publications: a joint editorial from journal editors. Fam Med Community Health 2025; 13:e003238. [PMID: 39805700 PMCID: PMC11752016 DOI: 10.1136/fmch-2024-003238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2024] [Accepted: 12/11/2024] [Indexed: 01/16/2025] Open
Affiliation(s)
- Sarina Schrager
- Family Medicine, Society of Teachers of Family Medicine (STFM), Leawood, Kansas, USA
- Department of Family Medicine and Community Health, School of Medicine and Public Health, University of Wisconsin, Madison, Wisconsin, USA
| | - Dean A Seehusen
- American Board of Family Medicine, Lexington, Kentucky, USA
- Department of Family and Community Medicine, Medical College of Georgia, Augusta University, Augusta, Georgia, USA
| | - Sumi M Sexton
- American Academy of Family Physician, Leawood, Kansas, USA
- Department of Family Medicine, Georgetown University School of Medicine, Washington, District of Columbia, USA
| | - Caroline Richardson
- Annals of Family Medicine, Providence, Rhode Island, USA
- Alpert Medical School, Brown University, Providence, Rhode Island, USA
| | - Jon Neher
- FPIN, Columbia, Missouri, USA
- Valley Medical Center FMR, Renton, Washinton, USA
| | - Nicholas Pimlott
- Canadian Family Physician, Mississauga, Ontario, Canada
- Department of Family and Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Marjorie Bowman
- American Board of Family Medicine, Lexington, Kentucky, USA
- Veterans Health Administration, Washington, District of Columbia, USA
| | - José E Rodríguez
- Family Medicine, Society of Teachers of Family Medicine (STFM), Leawood, Kansas, USA
- Family and Preventive Medicine, Spencer Fox Eccles School of Medicine, University of Utah Health, Salt Lake City, Utah, USA
| | - Christopher P Morley
- PRiMER, San Francisco, California, USA
- Departments of Public Health, Preventive Medicine and Family Medicine, SUNY Upstate Medical University, New York, New York, USA
| | - Li Li
- Family Medicine and Community Health, BMJ, London, UK
- Department of Family Medicine, University of Virginia, Charlottesville, Virginia, USA
| | - James DomDera
- FPM, New York, New York, USA
- Pioneer Physicians Network, Uniontown, Pennsylvania, USA
| |
Collapse
|
12
|
Schrager S, Seehusen DA, Sexton SM, Richardson CR, Neher JO, Pimlott N, Bowman MA, Rodríguez JE, Morley CP, Li L, Dera JD. Use of AI in Family Medicine Publications: A Joint Editorial From Journal Editors. PRIMER (LEAWOOD, KAN.) 2025; 9:3. [PMID: 39906880 PMCID: PMC11789701 DOI: 10.22454/primer.2025.889328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2025]
Affiliation(s)
- Sarina Schrager
- Editor in Chief, Family Medicine | Department of Family Medicine and Community Health, School of Medicine and Public Health, University of Wisconsin, Madison, WI
| | - Dean A Seehusen
- Deputy Editor, Journal of the American Board of Family Medicine | Department of Family and Community Medicine, Medical College of Georgia, Augusta University
| | - Sumi M Sexton
- Editor in Chief, American Family Physician and FP Essentials | Department of Family Medicine, Georgetown University School of Medicine
| | - Caroline R Richardson
- Editor in Chief, Annals of Family Medicine | Alpert Medical School, Brown University
| | - Jon O Neher
- Editor in Chief, Evidence-Based Practice | University of Washington/Valley Medical Center FMR
| | - Nicholas Pimlott
- Scientific Editor, Canadian Family Physician | Department of Family and Community Medicine, University of Toronto
| | - Marjorie A Bowman
- Editor in Chief, Journal of the American Board of Family Medicine | Veterans Health Administration
| | - José E Rodríguez
- Deputy Editor, Family Medicine | Family and Preventive Medicine, Spencer Fox Eccles School of Medicine, University of Utah Health
| | - Christopher P Morley
- Editor in Chief, PRiMER | Departments of Public Health & Preventive Medicine and Family Medicine, SUNY Upstate Medical University
| | - Li Li
- Editor in Chief, Family Medicine and Community Health | Department of Family Medicine, University of Virginia
| | | |
Collapse
|
13
|
Schrager SB, Seehusen DA, Sexton SM, Richardson CR, Neher J, Pimlott N, Bowman MA, Rodriguez J, Morley CP, Li L, Dera JD. Use of artificial intelligence in family medicine publications: Joint statement from journal editors. CANADIAN FAMILY PHYSICIAN MEDECIN DE FAMILLE CANADIEN 2025; 71:10-12. [PMID: 39843184 PMCID: PMC11753286 DOI: 10.46747/cfp.710110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2025]
Affiliation(s)
| | - Dean A Seehusen
- Deputy Editor of the Journal of the American Board of Family Medicine
| | | | | | - Jon Neher
- Editor-in-Chief of Evidence-Based Practice by Family Physicians Inquiries Network
| | | | | | | | | | - Li Li
- Editor-in-Chief of Family Medicine and Community Health
| | | |
Collapse
|
14
|
García-Rudolph A, Sanchez-Pinsach D, Opisso E, Soler MD. Exploring new educational approaches in neuropathic pain: assessing accuracy and consistency of artificial intelligence responses from GPT-3.5 and GPT-4. PAIN MEDICINE (MALDEN, MASS.) 2025; 26:48-50. [PMID: 39254649 DOI: 10.1093/pm/pnae094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/09/2024] [Revised: 07/31/2024] [Accepted: 09/05/2024] [Indexed: 09/11/2024]
Affiliation(s)
- Alejandro García-Rudolph
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Badalona, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Badalona, Barcelona, Spain
| | - David Sanchez-Pinsach
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Badalona, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Badalona, Barcelona, Spain
| | - Eloy Opisso
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Badalona, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Badalona, Barcelona, Spain
| | - Maria Dolors Soler
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Badalona, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Badalona, Barcelona, Spain
| |
Collapse
|
15
|
Pellegrino R, Federico A, Gravina AG. Conversational LLM Chatbot ChatGPT-4 for Colonoscopy Boston Bowel Preparation Scoring: An Artificial Intelligence-to-Head Concordance Analysis. Diagnostics (Basel) 2024; 14:2537. [PMID: 39594203 PMCID: PMC11593257 DOI: 10.3390/diagnostics14222537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 11/07/2024] [Accepted: 11/11/2024] [Indexed: 11/28/2024] Open
Abstract
BACKGROUND/OBJECTIVES To date, no studies have evaluated Chat Generative Pre-Trained Transformer (ChatGPT) as a large language model chatbot in optical applications for digestive endoscopy images. This study aimed to weigh the performance of ChatGPT-4 in assessing bowel preparation (BP) quality for colonoscopy. METHODS ChatGPT-4 analysed 663 anonymised endoscopic images, scoring each according to the Boston BP scale (BBPS). Expert physicians scored the same images subsequently. RESULTS ChatGPT-4 deemed 369 frames (62.9%) to be adequately prepared (i.e., BBPS > 1) compared to 524 frames (89.3%) assessed by human assessors. The agreement was slight (κ: 0.099, p = 0.0001). The raw human BBPS score was higher at 3 (2-3) than that of ChatGPT-4 at 2 (1-3), demonstrating moderate concordance (W: 0.554, p = 0.036). CONCLUSIONS ChatGPT-4 demonstrates some potential in assessing BP on colonoscopy images, but further refinement is still needed.
Collapse
Affiliation(s)
- Raffaele Pellegrino
- Hepatogastroenterology Division, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Via L. de Crecchio, 80138 Naples, Italy
| | | | | |
Collapse
|
16
|
Dergaa I, Ben Saad H, Glenn JM, Ben Aissa M, Taheri M, Swed S, Guelmami N, Chamari K. A thorough examination of ChatGPT-3.5 potential applications in medical writing: A preliminary study. Medicine (Baltimore) 2024; 103:e39757. [PMID: 39465713 PMCID: PMC11460921 DOI: 10.1097/md.0000000000039757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Indexed: 10/29/2024] Open
Abstract
Effective communication of scientific knowledge plays a crucial role in the advancement of medical research and health care. Technological advancements have introduced large language models such as Chat Generative Pre-Trained Transformer (ChatGPT), powered by artificial intelligence (AI), which has already shown promise in revolutionizing medical writing. This study aimed to conduct a detailed evaluation of ChatGPT-3.5's role in enhancing various aspects of medical writing. From May 10 to 12, 2023, the authors engaged in a series of interactions with ChatGPT-3.5 to evaluate its effectiveness in various tasks, particularly its application to medical writing, including vocabulary enhancement, text rewriting for plagiarism prevention, hypothesis generation, keyword generation, title generation, article summarization, simplification of medical jargon, transforming text from informal to scientific and data interpretation. The exploration of ChatGPT's functionalities in medical writing revealed its potential in enhancing various aspects of the writing process, demonstrating its efficiency in improving vocabulary usage, suggesting alternative phrasing, and providing grammar enhancements. While the results indicate the effectiveness of ChatGPT (version 3.5), the presence of certain imperfections highlights the current indispensability of human intervention to refine and validate outputs, ensuring accuracy and relevance in medical settings. The integration of AI into medical writing shows significant potential for improving clarity, efficiency, and reliability. This evaluation highlights both the benefits and limitations of using ChatGPT-3.5, emphasizing its ability to enhance vocabulary, prevent plagiarism, generate hypotheses, suggest keywords, summarize articles, simplify medical jargon, and transform informal text into an academic format. However, AI tools should not replace human expertise. It is crucial for medical professionals to ensure thorough human review and validation to maintain the accuracy and relevance of the content in case they eventually use AI as a supplementary resource in medical writing. Accepting this mutually symbiotic partnership holds the promise of improving medical research and patient outcomes, and it sets the stage for the fusion of AI and human knowledge to produce a novel approach to medical assessment. Thus, while AI can streamline certain tasks, experienced medical writers and researchers must perform final reviews to uphold high standards in medical communications.
Collapse
Affiliation(s)
- Ismail Dergaa
- Departement of Preventative Health, Primary Health Care Corporation (PHCC), Doha, Qatar
| | - Helmi Ben Saad
- Farhat HACHED Hospital, Service of Physiology and Functional Explorations, University of Sousse, Sousse, Tunisia
- Heart Failure (LR12SP09) Research Laboratory, Farhat HACHED Hospital, University of Sousse, Sousse, Tunisia
- Faculty of Medicine of Sousse, Laboratory of Physiology, University of Sousse, Sousse, Tunisia
| | - Jordan M. Glenn
- Department of Health, Exercise Science Research Center Human Performance and Recreation, University of Arkansas, Fayetteville, AR
| | - Mohamed Ben Aissa
- Department of Human and Social Sciences, Higher Institute of Sport and Physical Education of Kef, University of Jendouba, Jendouba, Tunisia
| | - Morteza Taheri
- Institute of Future Studies, Imam Khomeini International University, Qazvi, Iran
| | - Sarya Swed
- Faculty of Medicine, Aleppo University, Aleppo, Syria
| | - Noomen Guelmami
- Department of Health Sciences, Dipartimento di scienze della salute (DISSAL), Postgraduate School of Public Health, University of Genoa, Genoa, Italy
| | - Karim Chamari
- Naufar, Wellness and Recovery Center, Doha, Qatar
- High Institute of Sport and Physical Education, University of Manouba, Tunis, Tunisia
| |
Collapse
|
17
|
Magruder ML, Rodriguez AN, Wong JCJ, Erez O, Piuzzi NS, Scuderi GR, Slover JD, Oh JH, Schwarzkopf R, Chen AF, Iorio R, Goodman SB, Mont MA. Assessing Ability for ChatGPT to Answer Total Knee Arthroplasty-Related Questions. J Arthroplasty 2024; 39:2022-2027. [PMID: 38364879 DOI: 10.1016/j.arth.2024.02.023] [Citation(s) in RCA: 19] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/09/2023] [Revised: 01/19/2024] [Accepted: 02/08/2024] [Indexed: 02/18/2024] Open
Abstract
BACKGROUND Artificial intelligence in the field of orthopaedics has been a topic of increasing interest and opportunity in recent years. Its applications are widespread both for physicians and patients, including use in clinical decision-making, in the operating room, and in research. In this study, we aimed to assess the quality of ChatGPT answers when asked questions related to total knee arthroplasty. METHODS ChatGPT prompts were created by turning 15 of the American Academy of Orthopaedic Surgeons Clinical Practice Guidelines into questions. An online survey was created, which included screenshots of each prompt and answers to the 15 questions. Surgeons were asked to grade ChatGPT answers from 1 to 5 based on their characteristics: (1) relevance, (2) accuracy, (3) clarity, (4) completeness, (5) evidence-based, and (6) consistency. There were 11 Adult Joint Reconstruction fellowship-trained surgeons who completed the survey. Questions were subclassified based on the subject of the prompt: (1) risk factors, (2) implant/intraoperative, and (3) pain/functional outcomes. The average and standard deviation for all answers, as well as for each subgroup, were calculated. Inter-rater reliability (IRR) was also calculated. RESULTS All answer characteristics were graded as being above average (ie, a score > 3). Relevance demonstrated the highest scores (4.43 ± 0.77) by surgeons surveyed, and consistency demonstrated the lowest scores (3.54 ± 1.10). ChatGPT prompts in the Risk Factors group demonstrated the best responses, while those in the Pain/Functional Outcome group demonstrated the lowest. The overall IRR was found to be 0.33 (poor reliability), with the highest IRR for relevance (0.43) and the lowest for evidence-based (0.28). CONCLUSIONS ChatGPT can answer questions regarding well-established clinical guidelines in total knee arthroplasty with above-average accuracy but demonstrates variable reliability. This investigation is the first step in understanding large language model artificial intelligence like ChatGPT and how well they perform in the field of arthroplasty.
Collapse
Affiliation(s)
- Matthew L Magruder
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, New York
| | - Ariel N Rodriguez
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, New York
| | - Jason C J Wong
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, New York
| | - Orry Erez
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, New York
| | - Nicolas S Piuzzi
- Department of Orthopaedic Surgery, Cleveland Clinic, Cleveland, Ohio
| | - Gil R Scuderi
- Department of Orthopaedic Surgery, Lenox Hill Hospital, Northwell Orthopaedic Institute, New York, New York
| | - James D Slover
- Department of Orthopaedic Surgery, Lenox Hill Hospital, Northwell Orthopaedic Institute, New York, New York
| | - Jason H Oh
- Department of Orthopaedic Surgery, Lenox Hill Hospital, Northwell Orthopaedic Institute, New York, New York
| | - Ran Schwarzkopf
- Department of Orthopaedic Surgery, NYU Langone Orthopedics, NYU Langone Health, New York, New York
| | - Antonia F Chen
- Department of Orthopaedic Surgery, Brigham and Women's Hospital, Boston, Massachusetts
| | - Richard Iorio
- Department of Orthopaedic Surgery, Brigham and Women's Hospital, Boston, Massachusetts
| | - Stuart B Goodman
- Department of Orthopaedic Surgery, Stanford University School of Medicine, Redwood City, California
| | - Michael A Mont
- Rubin Institute for Advanced Orthopedics, Sinai Hospital of Baltimore, Baltimore, Maryland
| |
Collapse
|
18
|
Gravina AG, Pellegrino R, Palladino G, Imperio G, Ventura A, Federico A. Charting new AI education in gastroenterology: Cross-sectional evaluation of ChatGPT and perplexity AI in medical residency exam. Dig Liver Dis 2024; 56:1304-1311. [PMID: 38503659 DOI: 10.1016/j.dld.2024.02.019] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 02/08/2024] [Accepted: 02/28/2024] [Indexed: 03/21/2024]
Abstract
BACKGROUND Conversational chatbots, fueled by large language models, spark debate over their potential in education and medical career exams. There is debate in the literature about the scientific integrity of the outputs produced by these chatbots. AIMS This study evaluates ChatGPT 3.5 and Perplexity AI's cross-sectional performance in responding to questions from the 2023 Italian national residency admission exam (SSM23), comparing results and chatbots' concordance with previous years SSMs. METHODS Gastroenterology-related SSM23 questions were input into ChatGPT 3.5 and Perplexity AI, evaluating their performance in correct responses and total scores. This process was repeated with questions from the three preceding years. Additionally, chatbot concordance was assessed using Cohen's method. RESULTS In SSM23, ChatGPT 3.5 outperforms Perplexity AI with 94.11% correct responses, demonstrating consistency across years. Concordance weakened in 2023 (κ=0.203, P = 0.148), but ChatGPT consistently maintains a high standard compared to Perplexity AI. CONCLUSION ChatGPT 3.5 and Perplexity AI exhibit promise in addressing gastroenterological queries, emphasizing potential educational roles. However, their variable performance mandates cautious use as supplementary tools alongside conventional study methods. Clear guidelines are crucial for educators to balance traditional approaches and innovative systems, enhancing educational standards.
Collapse
Affiliation(s)
- Antonietta Gerarda Gravina
- Hepatogastroenterology Division, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Via Luigi de Crecchio, 80138, Naples, Italy
| | - Raffaele Pellegrino
- Hepatogastroenterology Division, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Via Luigi de Crecchio, 80138, Naples, Italy.
| | - Giovanna Palladino
- Hepatogastroenterology Division, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Via Luigi de Crecchio, 80138, Naples, Italy
| | - Giuseppe Imperio
- Hepatogastroenterology Division, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Via Luigi de Crecchio, 80138, Naples, Italy
| | - Andrea Ventura
- Hepatogastroenterology Division, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Via Luigi de Crecchio, 80138, Naples, Italy
| | - Alessandro Federico
- Hepatogastroenterology Division, Department of Precision Medicine, University of Campania Luigi Vanvitelli, Via Luigi de Crecchio, 80138, Naples, Italy
| |
Collapse
|
19
|
Sallam M. Bibliometric top ten healthcare-related ChatGPT publications in the first ChatGPT anniversary. NARRA J 2024; 4:e917. [PMID: 39280327 PMCID: PMC11391998 DOI: 10.52225/narra.v4i2.917] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Accepted: 07/29/2024] [Indexed: 09/18/2024]
Abstract
Since its public release on November 30, 2022, ChatGPT has shown promising potential in diverse healthcare applications despite ethical challenges, privacy issues, and possible biases. The aim of this study was to identify and assess the most influential publications in the field of ChatGPT utility in healthcare using bibliometric analysis. The study employed an advanced search on three databases, Scopus, Web of Science, and Google Scholar, to identify ChatGPT-related records in healthcare education, research, and practice between November 27 and 30, 2023. The ranking was based on the retrieved citation count in each database. The additional alternative metrics that were evaluated included (1) Semantic Scholar highly influential citations, (2) PlumX captures, (3) PlumX mentions, (4) PlumX social media and (5) Altmetric Attention Scores (AASs). A total of 22 unique records published in 17 different scientific journals from 14 different publishers were identified in the three databases. Only two publications were in the top 10 list across the three databases. Variable publication types were identified, with the most common being editorial/commentary publications (n=8/22, 36.4%). Nine of the 22 records had corresponding authors affiliated with institutions in the United States (40.9%). The range of citation count varied per database, with the highest range identified in Google Scholar (1019-121), followed by Scopus (242-88), and Web of Science (171-23). Google Scholar citations were correlated significantly with the following metrics: Semantic Scholar highly influential citations (Spearman's correlation coefficient ρ=0.840, p<0.001), PlumX captures (ρ=0.831, p<0.001), PlumX mentions (ρ=0.609, p=0.004), and AASs (ρ=0.542, p=0.009). In conclusion, despite several acknowledged limitations, this study showed the evolving landscape of ChatGPT utility in healthcare. There is an urgent need for collaborative initiatives by all stakeholders involved to establish guidelines for ethical, transparent, and responsible use of ChatGPT in healthcare. The study revealed the correlation between citations and alternative metrics, highlighting its usefulness as a supplement to gauge the impact of publications, even in a rapidly growing research field.
Collapse
Affiliation(s)
- Malik Sallam
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan
- Department of Translational Medicine, Faculty of Medicine, Lund University, Malmö, Sweden
| |
Collapse
|
20
|
Aljamaan F, Temsah MH, Altamimi I, Al-Eyadhy A, Jamal A, Alhasan K, Mesallam TA, Farahat M, Malki KH. Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study. JMIR Med Inform 2024; 12:e54345. [PMID: 39083799 PMCID: PMC11325115 DOI: 10.2196/54345] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 01/05/2024] [Accepted: 07/03/2024] [Indexed: 08/02/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI) chatbots have recently gained use in medical practice by health care practitioners. Interestingly, the output of these AI chatbots was found to have varying degrees of hallucination in content and references. Such hallucinations generate doubts about their output and their implementation. OBJECTIVE The aim of our study was to propose a reference hallucination score (RHS) to evaluate the authenticity of AI chatbots' citations. METHODS Six AI chatbots were challenged with the same 10 medical prompts, requesting 10 references per prompt. The RHS is composed of 6 bibliographic items and the reference's relevance to prompts' keywords. RHS was calculated for each reference, prompt, and type of prompt (basic vs complex). The average RHS was calculated for each AI chatbot and compared across the different types of prompts and AI chatbots. RESULTS Bard failed to generate any references. ChatGPT 3.5 and Bing generated the highest RHS (score=11), while Elicit and SciSpace generated the lowest RHS (score=1), and Perplexity generated a middle RHS (score=7). The highest degree of hallucination was observed for reference relevancy to the prompt keywords (308/500, 61.6%), while the lowest was for reference titles (169/500, 33.8%). ChatGPT and Bing had comparable RHS (β coefficient=-0.069; P=.32), while Perplexity had significantly lower RHS than ChatGPT (β coefficient=-0.345; P<.001). AI chatbots generally had significantly higher RHS when prompted with scenarios or complex format prompts (β coefficient=0.486; P<.001). CONCLUSIONS The variation in RHS underscores the necessity for a robust reference evaluation tool to improve the authenticity of AI chatbots. Further, the variations highlight the importance of verifying their output and citations. Elicit and SciSpace had negligible hallucination, while ChatGPT and Bing had critical hallucination levels. The proposed AI chatbots' RHS could contribute to ongoing efforts to enhance AI's general reliability in medical research.
Collapse
Affiliation(s)
- Fadi Aljamaan
- College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | | | | | - Ayman Al-Eyadhy
- College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | - Amr Jamal
- College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | - Khalid Alhasan
- College of Medicine, King Saud University, Riyadh, Saudi Arabia
| | - Tamer A Mesallam
- Department of Otolaryngology, College of Medicine, Research Chair of Voice, Swallowing, and Communication Disorders, King Saud University, Riyadh, Saudi Arabia
| | - Mohamed Farahat
- Department of Otolaryngology, College of Medicine, Research Chair of Voice, Swallowing, and Communication Disorders, King Saud University, Riyadh, Saudi Arabia
| | - Khalid H Malki
- Department of Otolaryngology, College of Medicine, Research Chair of Voice, Swallowing, and Communication Disorders, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
21
|
Liu M, Okuhara T, Chang X, Shirabe R, Nishiie Y, Okada H, Kiuchi T. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis. J Med Internet Res 2024; 26:e60807. [PMID: 39052324 PMCID: PMC11310649 DOI: 10.2196/60807] [Citation(s) in RCA: 31] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 06/11/2024] [Accepted: 06/15/2024] [Indexed: 07/27/2024] Open
Abstract
BACKGROUND Over the past 2 years, researchers have used various medical licensing examinations to test whether ChatGPT (OpenAI) possesses accurate medical knowledge. The performance of each version of ChatGPT on the medical licensing examination in multiple environments showed remarkable differences. At this stage, there is still a lack of a comprehensive understanding of the variability in ChatGPT's performance on different medical licensing examinations. OBJECTIVE In this study, we reviewed all studies on ChatGPT performance in medical licensing examinations up to March 2024. This review aims to contribute to the evolving discourse on artificial intelligence (AI) in medical education by providing a comprehensive analysis of the performance of ChatGPT in various environments. The insights gained from this systematic review will guide educators, policymakers, and technical experts to effectively and judiciously use AI in medical education. METHODS We searched the literature published between January 1, 2022, and March 29, 2024, by searching query strings in Web of Science, PubMed, and Scopus. Two authors screened the literature according to the inclusion and exclusion criteria, extracted data, and independently assessed the quality of the literature concerning Quality Assessment of Diagnostic Accuracy Studies-2. We conducted both qualitative and quantitative analyses. RESULTS A total of 45 studies on the performance of different versions of ChatGPT in medical licensing examinations were included in this study. GPT-4 achieved an overall accuracy rate of 81% (95% CI 78-84; P<.01), significantly surpassing the 58% (95% CI 53-63; P<.01) accuracy rate of GPT-3.5. GPT-4 passed the medical examinations in 26 of 29 cases, outperforming the average scores of medical students in 13 of 17 cases. Translating the examination questions into English improved GPT-3.5's performance but did not affect GPT-4. GPT-3.5 showed no difference in performance between examinations from English-speaking and non-English-speaking countries (P=.72), but GPT-4 performed better on examinations from English-speaking countries significantly (P=.02). Any type of prompt could significantly improve GPT-3.5's (P=.03) and GPT-4's (P<.01) performance. GPT-3.5 performed better on short-text questions than on long-text questions. The difficulty of the questions affected the performance of GPT-3.5 and GPT-4. In image-based multiple-choice questions (MCQs), ChatGPT's accuracy rate ranges from 13.1% to 100%. ChatGPT performed significantly worse on open-ended questions than on MCQs. CONCLUSIONS GPT-4 demonstrates considerable potential for future use in medical education. However, due to its insufficient accuracy, inconsistent performance, and the challenges posed by differing medical policies and knowledge across countries, GPT-4 is not yet suitable for use in medical education. TRIAL REGISTRATION PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687.
Collapse
Affiliation(s)
- Mingxin Liu
- Department of Health Communication, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Tsuyoshi Okuhara
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - XinYi Chang
- Department of Industrial Engineering and Economics, School of Engineering, Tokyo Institute of Technology, Tokyo, Japan
| | - Ritsuko Shirabe
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Yuriko Nishiie
- Department of Health Communication, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Hiroko Okada
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Takahiro Kiuchi
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
22
|
Fadel C, Milanova A, Suran J, Sitovs A, Kim TW, Bello A, Abay SM, Horst S, Mileva R, Amadori M, Oster E, Re G, Abdul Kadir A, Gambino G, Vercelli C. A narrative review of the phenomenon of predatory journals to create awareness among researchers in veterinary medicine. J Vet Pharmacol Ther 2024; 47:239-251. [PMID: 38654516 DOI: 10.1111/jvp.13448] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 03/28/2024] [Accepted: 04/08/2024] [Indexed: 04/26/2024]
Abstract
In recent years, especially since the COVID-19 pandemic, the number of predatory journals has increased significantly. Predatory journals exploit the "open-access model" by engaging in deceptive practices such as charging high publication fees without providing the expected quality and performing insufficient or no peer review. Such behaviors undermine the integrity of scientific research and can result in researchers having trouble identifying reputable publication opportunities, particularly early-career researchers who struggle to understand and establish the correct criteria for publication in reputable journals. Publishing in journals that do not fully cover the criteria for scientific publication is also an ethical issue. This review aimed to describe the characteristics of predatory journals, differentiate between reliable and predatory journals, investigate the reasons that lead researchers to publish in predatory journals, evaluate the negative impact of predatory publications on the scientific community, and explore future perspectives. The authors also provide some considerations for researchers (particularly early-career researchers) when selecting journals for publication, explaining the role of metrics, databases, and artificial intelligence in manuscript preparation, with a specific focus on and relevance to publication in veterinary medicine.
Collapse
Affiliation(s)
- Charbel Fadel
- Department of Veterinary Sciences, University of Pisa, Pisa, Italy
| | - Aneliya Milanova
- Faculty of Veterinary Medicine, Trakia University, Stara Zagora, Bulgaria
| | | | - Andrejs Sitovs
- Department of Pharmacology, Rīga Stradiņš University, Riga, Latvia
- Laboratory of Finished Dosage Forms, Rīga Stradiņš University, Riga, Latvia
| | - Tae Won Kim
- College of Veterinary Medicine, Chungnam National University, Daejeon, South Korea
| | - Abubakar Bello
- Department of Pharmacology and Toxicology, Faculty of Veterinary Medicine, Wroclaw University of Environmental and Life Sciences, Wroclaw, Poland
| | - Solomon Mequanente Abay
- Department of Pharmacology and Clinical Pharmacy, Addis Ababa University, Addis Ababa, Ethiopia
| | - Stefanie Horst
- Department of Population Health Sciences, Institute of Risk Assessment Sciences (IRAS), One Health Pharmacology, Utrecht University, Utrecht, The Netherlands
| | - Rositsa Mileva
- Faculty of Veterinary Medicine, Trakia University, Stara Zagora, Bulgaria
| | - Michela Amadori
- Department of Veterinary Sciences, University of Torino, Torino, Italy
| | - Ena Oster
- University of Zagreb, Faculty of Veterinary Medicine, Zagreb, Croatia
| | - Giovanni Re
- Department of Veterinary Sciences, University of Torino, Torino, Italy
| | - Arifah Abdul Kadir
- Department of Veterinary Preclinical Sciences, Faculty of Veterinary Medicine, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Graziana Gambino
- Department of Veterinary Sciences, University of Torino, Torino, Italy
| | - Cristina Vercelli
- Department of Veterinary Sciences, University of Torino, Torino, Italy
| |
Collapse
|
23
|
Lucas F, Mackie I, d'Onofrio G, Frater JL. Responsible use of chatbots to advance the laboratory hematology scientific literature: Challenges and opportunities. Int J Lab Hematol 2024; 46 Suppl 1:9-11. [PMID: 38639069 DOI: 10.1111/ijlh.14285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2024] [Accepted: 04/09/2024] [Indexed: 04/20/2024]
Affiliation(s)
- Fabienne Lucas
- Department of Pathology, University of Washington, Seattle, Washington, USA
| | - Ian Mackie
- Haemostasis Research Unit, University College London, London, UK
| | | | - John L Frater
- Department of Pathology and Immunology, Washington University, St Louis, Missouri, USA
| |
Collapse
|
24
|
Wu J, Ma Y, Wang J, Xiao M. The Application of ChatGPT in Medicine: A Scoping Review and Bibliometric Analysis. J Multidiscip Healthc 2024; 17:1681-1692. [PMID: 38650670 PMCID: PMC11034560 DOI: 10.2147/jmdh.s463128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 03/25/2024] [Indexed: 04/25/2024] Open
Abstract
Purpose ChatGPT has a wide range of applications in the medical field. Therefore, this review aims to define the key issues and provide a comprehensive view of the literature based on the application of ChatGPT in medicine. Methods This scope follows Arksey and O'Malley's five-stage framework. A comprehensive literature search of publications (30 November 2022 to 16 August 2023) was conducted. Six databases were searched and relevant references were systematically catalogued. Attention was focused on the general characteristics of the articles, their fields of application, and the advantages and disadvantages of using ChatGPT. Descriptive statistics and narrative synthesis methods were used for data analysis. Results Of the 3426 studies, 247 met the criteria for inclusion in this review. The majority of articles (31.17%) were from the United States. Editorials (43.32%) ranked first, followed by experimental studys (11.74%). The potential applications of ChatGPT in medicine are varied, with the largest number of studies (45.75%) exploring clinical practice, including assisting with clinical decision support and providing disease information and medical advice. This was followed by medical education (27.13%) and scientific research (16.19%). Particularly noteworthy in the discipline statistics were radiology, surgery and dentistry at the top of the list. However, ChatGPT in medicine also faces issues of data privacy, inaccuracy and plagiarism. Conclusion The application of ChatGPT in medicine focuses on different disciplines and general application scenarios. ChatGPT has a paradoxical nature: it offers significant advantages, but at the same time raises great concerns about its application in healthcare settings. Therefore, it is imperative to develop theoretical frameworks that not only address its widespread use in healthcare but also facilitate a comprehensive assessment. In addition, these frameworks should contribute to the development of strict and effective guidelines and regulatory measures.
Collapse
Affiliation(s)
- Jie Wu
- Department of Nursing, the First Affiliated Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| | - Yingzhuo Ma
- Department of Nursing, the First Affiliated Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| | - Jun Wang
- Department of Nursing, the First Affiliated Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| | - Mingzhao Xiao
- Department of Urology, the First Affiliated Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| |
Collapse
|
25
|
Sallam M, Barakat M, Sallam M. A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res 2024; 13:e54704. [PMID: 38276872 PMCID: PMC10905357 DOI: 10.2196/54704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Revised: 12/18/2023] [Accepted: 01/26/2024] [Indexed: 01/27/2024] Open
Abstract
BACKGROUND Adherence to evidence-based practice is indispensable in health care. Recently, the utility of generative artificial intelligence (AI) models in health care has been evaluated extensively. However, the lack of consensus guidelines on the design and reporting of findings of these studies poses a challenge for the interpretation and synthesis of evidence. OBJECTIVE This study aimed to develop a preliminary checklist to standardize the reporting of generative AI-based studies in health care education and practice. METHODS A literature review was conducted in Scopus, PubMed, and Google Scholar. Published records with "ChatGPT," "Bing," or "Bard" in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and the possible gaps in reporting. A panel discussion was held to establish a unified and thorough checklist for the reporting of AI studies in health care. The finalized checklist was used to evaluate the included records by 2 independent raters. Cohen κ was used as the method to evaluate the interrater reliability. RESULTS The final data set that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included 9 pertinent themes collectively referred to as METRICS (Model, Evaluation, Timing, Range/Randomization, Individual factors, Count, and Specificity of prompts and language). Their details are as follows: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and interrater reliability; (8) Count of queries executed to test the model; and (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0 (SD 0.58). The tested METRICS score was acceptable, with the range of Cohen κ of 0.558 to 0.962 (P<.001 for the 9 tested items). With classification per item, the highest average METRICS score was recorded for the "Model" item, followed by the "Specificity" item, while the lowest scores were recorded for the "Randomization" item (classified as suboptimal) and "Individual factors" item (classified as satisfactory). CONCLUSIONS The METRICS checklist can facilitate the design of studies guiding researchers toward best practices in reporting results. The findings highlight the need for standardized reporting algorithms for generative AI-based studies in health care, considering the variability observed in methodologies and reporting. The proposed METRICS checklist could be a preliminary helpful base to establish a universally accepted approach to standardize the design and reporting of generative AI-based studies in health care, which is a swiftly evolving research topic.
Collapse
Affiliation(s)
- Malik Sallam
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan
- Department of Translational Medicine, Faculty of Medicine, Lund University, Malmo, Sweden
| | - Muna Barakat
- Department of Clinical Pharmacy and Therapeutics, Faculty of Pharmacy, Applied Science Private University, Amman, Jordan
| | - Mohammed Sallam
- Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai, United Arab Emirates
| |
Collapse
|