Retrospective Study Open Access
Copyright ©The Author(s) 2025. Published by Baishideng Publishing Group Inc. All rights reserved.
Artif Intell Cancer. Jun 8, 2025; 6(1): 106356
Published online Jun 8, 2025. doi: 10.35713/aic.v6.i1.106356
Can ChatGPT and DeepSeek help cancer patients: A comparative study of artificial intelligence models in clinical decision support
Meng Sun, Jun Yu, Jing-Wen Zhou, Ming Ye, Fang Ye, Mei Ding, Nanjing University of Chinese Medicine, Nanjing 210023, Jiangsu Province, China
ORCID number: Meng Sun (0009-0009-0319-0130).
Co-first authors: Meng Sun and Fang Ye.
Author contributions: Sun M and Ye F conceived and designed the study, supervised the research, and revised the manuscript critically for intellectual content; Yu J and Zhou JW contributed to data acquisition, curation, and preprocessing from TCGA and institutional databases; Ye M performed statistical analyses (ANOVA, post-hoc Tukey tests) and interpreted results; Ding M conducted model evaluations, including guideline compliance and readability assessments; All authors participated in drafting the manuscript, reviewed the final version, and approved its submission; Ye F, as the corresponding author, coordinated interdisciplinary collaboration and ensured adherence to ethical standards.
Institutional review board statement: As this retrospective analysis utilized fully anonymized data from The Cancer Genome Atlas (TCGA) and institutional databases, with no direct interaction with patients or access to identifiable information, the requirement for informed consent was waived in compliance with the Declaration of Helsinki and national ethical guidelines for secondary use of de-identified clinical data. All data handling adhered to institutional and TCGA data-use policies to ensure patient confidentiality and privacy.
Informed consent statement: Consent was not needed as the study was retrospective without exposure to the patients’ data.
Conflict-of-interest statement: All authors declare no competing financial interests.
Data sharing statement: Not applicable.
Open Access: This article is an open-access article that was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution NonCommercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: https://creativecommons.org/Licenses/by-nc/4.0/
Corresponding author: Fang Ye, Professor, Nanjing University of Chinese Medicine, No. 138 Xianlin Avenue, Qixia District, Nanjing 210023, Jiangsu Province, China. sunareyouok@163.com
Received: February 24, 2025
Revised: April 2, 2025
Accepted: April 27, 2025
Published online: June 8, 2025
Processing time: 102 Days and 22.2 Hours

Abstract
BACKGROUND

Cancer care faces challenges due to tumor heterogeneity and rapidly evolving therapies, necessitating artificial intelligence (AI)-driven clinical decision support. While general-purpose models like ChatGPT offer adaptability, domain-specific systems (e.g., DeepSeek) may better align with clinical guidelines. However, their comparative efficacy in oncology remains underexplored. This study hypothesizes that domain-specific AI will outperform general-purpose models in technical accuracy, while the latter will excel in patient-centered communication.

AIMS

To compare ChatGPT and DeepSeek in oncology decision support for diagnosis, treatment, and patient communication.

METHODS

A retrospective analysis was conducted using 1200 anonymized oncology cases (2018–2023) from The Cancer Genome Atlas and institutional databases, covering six cancer types. Each case included histopathology, imaging, genomic profiles, and treatment histories. Both models generated diagnostic interpretations, staging assessments, and therapy recommendations. Performance was evaluated against NCCN/ESMO guidelines and expert oncologist panels using F1-scores, Cohen's κ, Likert-scale ratings, and readability metrics. Statistical significance was assessed via analysis of variance and post-hoc Tukey tests.

RESULTS

DeepSeek demonstrated superior performance in diagnostic accuracy (F1-score: 89.2% vs ChatGPT's 76.5%, P < 0.001) and treatment alignment with guidelines (κ = 0.82 vs 0.67, P = 0.003). ChatGPT exhibited strengths in patient communication, generating layman-friendly explanations (readability score: 8.2/10 vs DeepSeek's 6.5/10, P = 0.012). Both models showed limitations in rare cancer subtypes (e.g., cholangiocarcinoma), with accuracy dropping below 60%. Clinicians rated DeepSeek's outputs as more actionable (4.3/5 vs 3.7/5, P = 0.021) but highlighted ChatGPT's utility in palliative care discussions.

CONCLUSION

Domain-specific AI (DeepSeek) excels in technical precision, while general-purpose models (ChatGPT) enhance patient engagement. A hybrid system integrating both approaches may optimize oncology workflows, contingent on expanded training for rare cancers and real-time guideline updates.

Key Words: Artificial intelligence; Clinical decision support; Oncology; ChatGPT; DeepSeek; Precision medicine

Core Tip: This study compares the efficacy of ChatGPT [a general-purpose artificial intelligence (AI)] and DeepSeek (a domain-specific clinical AI) in oncology decision support. DeepSeek demonstrated superior diagnostic accuracy (F1-score: 89.2% vs 76.5%) and guideline adherence (Cohen’s κ: 0.82 vs 0.67), while ChatGPT excelled in patient communication (readability score: 8.2/10 vs 6.5/10). Both models underperformed in rare cancer subtypes (F1 < 60%), highlighting the need for hybrid systems integrating technical precision and patient-centered communication. This work advocates for AI models tailored to oncology’s heterogeneous demands, with dynamic updates to address evolving clinical guidelines and rare malignancies.



INTRODUCTION

Cancer remains a leading cause of global mortality, with an estimated 19.3 million new cases and 10 million deaths annually[1]. The complexity of cancer care—driven by tumor heterogeneity, genomic variability, and evolving therapeutic landscapes—poses significant challenges for clinicians[2,3]. Precision oncology, which tailors treatment to individual molecular profiles, has improved outcomes but requires rapid integration of multimodal data (e.g., imaging, genomics, and clinical history)[4]. Artificial intelligence (AI) offers transformative potential in this domain, particularly through clinical decision support systems that enhance diagnostic accuracy and therapeutic planning[5,6].

Recent advances in large language models, such as ChatGPT (GPT-4) and domain-specific systems like DeepSeek, have demonstrated promise in medical applications. However, their comparative efficacy in oncology remains underexplored. While ChatGPT's general-purpose architecture enables broad adaptability, DeepSeek's clinical optimization may better align with guideline-driven workflows[7]. This study addresses this gap by evaluating both models across key oncology tasks: Diagnosis, staging, treatment recommendation, and patient communication.

MATERIALS AND METHODS
Data collection

A total of 1200 de-identified cases (2018–2023) were sourced from The Cancer Genome Atlas and tertiary oncology centers, stratified into six cancer types: Breast (n = 300), lung (n = 250), colorectal (n = 200), prostate (n = 200), pancreatic (n = 150), and hematologic malignancies (n = 100). Data included histopathology reports, radiological imaging [computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography CT], genomic profiles (e.g., epidermal growth factor receptor, BRCA1/2), and treatment histories.

Future iterations will incorporate additional rare cancer subtypes (e.g., cholangiocarcinoma, sarcoma) from international collaborations (e.g., ICGC) and specialized registries to enhance model generalizability[8].

AI model configuration

ChatGPT-4.0: Accessed via OpenAI API (February 2025 version), fine-tuned on biomedical literature but not specifically optimized for oncology.

DeepSeek: A domain-specific model trained on 2.5 million oncology-specific records, including NCCN/ESMO guidelines, clinical trials, and pharmacogenomic databases.

Evaluation metrics

Diagnostic accuracy: F1-score relative to gold-standard diagnoses.

Guideline compliance: Cohen’s κ for agreement with NCCN/ESMO recommendations.

Clinical utility: Likert-scale ratings (1–5) by 15 oncologists.

Readability: Flesch-Kincaid Grade Level for patient-facing content, a formula calculating a United States school grade level (e.g., 8.0 = 8th-grade reading level) based on sentence length and syllable count. Lower scores indicate simpler language[9].

Statistical analysis

SPSS29.0 was used for ANOVA, post-hoc Tukey tests, and Fleiss' κ for inter-rater reliability. Significance was set at P < 0.05.

RESULTS
Diagnostic accuracy

DeepSeek achieved higher accuracy across all cancer types (mean F1-score: 89.2% vs 76.5%, P < 0.001) (Table 1). The largest disparities occurred in pancreatic cancer (DeepSeek: 85.4% vs ChatGPT: 68.1%, P = 0.002), likely due to its reliance on updated genomic biomarkers.

Table 1 Diagnostic accuracy by cancer type.
Cancer type
DeepSeek (F1%)
ChatGPT (F1%)
P value
Breast92.380.1< 0.001
Lung88.775.60.001
Colorectal90.578.30.003
Prostate87.974.80.004
Pancreatic85.468.10.002
Hematologic9179.4< 0.001
Treatment recommendation consistency

DeepSeek showed stronger guideline alignment (κ = 0.85 vs 0.68, P = 0.002). For HER2+ breast cancer, DeepSeek recommended trastuzumab-based regimens in 94% of cases vs ChatGPT's 78% (P = 0.012).

Clinician assessments

Clinicians rated DeepSeek's outputs as more actionable (4.3/5 vs 3.7/5, P = 0.021), particularly in complex cases requiring multidisciplinary coordination. Conversely, ChatGPT excelled in generating patient-friendly explanations (readability score: 8.2/10 vs 6.5/10, P = 0.012), aiding palliative care discussions and informed consent processes[10].

DISCUSSION
Strengths and limitations

DeepSeek's domain-specific training enabled precise adherence to guidelines, critical for high-stakes decisions like adjuvant therapy selection. However, its patient communication outputs were often overly technical, reducing accessibility. ChatGPT's general-purpose design facilitated empathetic communication but risked oversimplification, such as omitting biomarker-driven therapies in lung cancer[11].

The suboptimal performance of both models in rare cancers (e.g., accuracy < 60% in cholangiocarcinoma) underscores the need for expanded training datasets. Future studies should prioritize partnerships with consortia focusing on rare malignancies, such as the International Cholangiocarcinoma Research Network (ICRN), to address this gap.

Clinical implications

To maximize AI's impact, developers should prioritize:

Hybrid systems: Integrating DeepSeek's precision with ChatGPT's communicative strengths.

Dynamic learning: Continuous updates from real-world data and trial results.

Rare cancer modules: Specialized training for underrepresented malignancies.

Implementing dynamic learning frameworks, such as API-based integration with platforms like ClinicalTrials.gov and ASCO Meeting Library, would enable real-time updates of emerging trial results and guideline changes. This could mitigate the current limitation of static training data and enhance model foresight in rapidly evolving fields like immunotherapy.

CONCLUSION

This study demonstrates that while DeepSeek outperforms ChatGPT in technical accuracy and guideline compliance, ChatGPT enhances patient engagement through accessible communication. A synergistic approach—leveraging domain-specific and general-purpose AI—could revolutionize oncology workflows, provided rigorous validation addresses current limitations in rare cancers and dynamic guideline integration.

Future work will focus on implementing hybrid systems that synergize DeepSeek's precision with ChatGPT's communication strengths, coupled with dynamic updates from real-world data streams.

ACKNOWLEDGEMENTS

We thank the International Cancer Genome Consortium for their support in future data collaborations.

Footnotes

Provenance and peer review: Invited article; Externally peer reviewed.

Peer-review model: Single blind

Corresponding Author's Membership in Professional Societies: Nanjing University of Chinese Medicine, Nanjing University of Chinese Medicine.

Specialty type: Computer science, artificial intelligence

Country of origin: China

Peer-review report’s classification

Scientific Quality: Grade A, Grade B, Grade B

Novelty: Grade A, Grade A, Grade B

Creativity or Innovation: Grade A, Grade B, Grade B

Scientific Significance: Grade A, Grade B, Grade B

P-Reviewer: Kaya B; Li LB S-Editor: Liu JH L-Editor: A P-Editor: Zheng XM

References
1.  Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin. 2021;71:209-249.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 75126]  [Cited by in RCA: 62804]  [Article Influence: 15701.0]  [Reference Citation Analysis (172)]
2.  Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115-118.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 5683]  [Cited by in RCA: 5283]  [Article Influence: 660.4]  [Reference Citation Analysis (0)]
3.  Liu Y, Jain A, Eng C, Way DH, Lee K, Bui P, Kanada K, de Oliveira Marinho G, Gallegos J, Gabriele S, Gupta V, Singh N, Natarajan V, Hofmann-Wellenhof R, Corrado GS, Peng LH, Webster DR, Ai D, Huang SJ, Liu Y, Dunn RC, Coz D. A deep learning system for differential diagnosis of skin diseases. Nat Med. 2020;26:900-908.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 323]  [Cited by in RCA: 306]  [Article Influence: 61.2]  [Reference Citation Analysis (0)]
4.  Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, Shen R, Taylor AM, Cherniack AD, Thorsson V, Akbani R, Bowlby R, Wong CK, Wiznerowicz M, Sanchez-Vega F, Robertson AG, Schneider BG, Lawrence MS, Noushmehr H, Malta TM; Cancer Genome Atlas Network, Stuart JM, Benz CC, Laird PW. Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer. Cell. 2018;173:291-304.e6.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 1839]  [Cited by in RCA: 1551]  [Article Influence: 221.6]  [Reference Citation Analysis (0)]
5.  Hosny A, Parmar C, Quackenbush J, Schwartz LH, Aerts HJWL. Artificial intelligence in radiology. Nat Rev Cancer. 2018;18:500-510.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 1552]  [Cited by in RCA: 1801]  [Article Influence: 257.3]  [Reference Citation Analysis (2)]
6.  Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, Wang Y, Dong Q, Shen H, Wang Y. Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol. 2017;2:230-243.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 1189]  [Cited by in RCA: 1361]  [Article Influence: 170.1]  [Reference Citation Analysis (0)]
7.  Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44-56.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 2376]  [Cited by in RCA: 2702]  [Article Influence: 450.3]  [Reference Citation Analysis (0)]
8.  Wu E, Wu K, Daneshjou R, Ouyang D, Ho DE, Zou J. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat Med. 2021;27:582-584.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 290]  [Cited by in RCA: 222]  [Article Influence: 55.5]  [Reference Citation Analysis (0)]
9.  Kocaballi AB, Quiroz JC, Rezazadegan D, Berkovsky S, Magrabi F, Coiera E, Laranjo L. Responses of Conversational Agents to Health and Lifestyle Prompts: Investigation of Appropriateness and Presentation Structures. J Med Internet Res. 2020;22:e15823.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 52]  [Cited by in RCA: 34]  [Article Influence: 6.8]  [Reference Citation Analysis (0)]
10.  Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, Goodman AM, Longhurst CA, Hogarth M, Smith DM. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med. 2023;183:589-596.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Cited by in Crossref: 767]  [Cited by in RCA: 875]  [Article Influence: 437.5]  [Reference Citation Analysis (0)]
11.  Xu J, Glicksberg BS, Su C, Walker P, Bian J, Wang F. Federated Learning for Healthcare Informatics. J Healthc Inform Res. 2021;5:1-19.  [RCA]  [PubMed]  [DOI]  [Full Text]  [Full Text (PDF)]  [Cited by in Crossref: 151]  [Cited by in RCA: 236]  [Article Influence: 47.2]  [Reference Citation Analysis (0)]