Copyright
©The Author(s) 2025.
World J Gastroenterol. Jan 21, 2025; 31(3): 101092
Published online Jan 21, 2025. doi: 10.3748/wjg.v31.i3.101092
Published online Jan 21, 2025. doi: 10.3748/wjg.v31.i3.101092
Test questions by subfields | ChatGPT-3.5, correct | ChatGPT-3.5, incorrect | ChatGPT-4.0, correct | ChatGPT-4.0, incorrect | Google Gemini, correct | Google Gemini, incorrect |
All test questions | 52 | 52 | 52 | |||
1st run | 34 (65.4) | 18 (34.6) | 43 (82.7) | 9 (17.3) | 37 (71.1) | 15 (28.9) |
2nd run | 30 (57.7) | 22 (42.3) | 41 (78.9) | 11 (21.1) | 38 (73.1) | 14 (26.9) |
3rd run | 34 (65.4) | 18 (34.6) | 42 (80.8) | 10 (19.2) | 39 (75) | 13 (25) |
Concordance among 3 runs | 41 (78.9) | 46 (88.4) | 50 (96.2) | |||
Total accuracy (%) | 62.9 | 80.8 | 73.1 | |||
Risk factors (n) | 5 | 5 | 5 | |||
1st run | 5 (100) | 0 (0) | 5 (100) | 0 (0) | 5 (100) | 0 (0) |
2nd run | 5 (100) | 0 (0) | 5 (100) | 0 (0) | 5 (100) | 0 (0) |
3rd run | 5 (100) | 0 (0) | 5 (100) | 0 (0) | 5 (100) | 0 (0) |
Concordance among 3 runs | 5 (100) | 5 (100) | 5 (100) | |||
Total accuracy (%) | 100 | 100 | 100 | |||
Clinical manifestation | 7 | 7 | 7 | |||
1st run | 2 (40) | 5 (71.4) | 4 (57.1) | 3 (42.9) | 5 (71.4) | 2 (28.6) |
2nd run | 2 (40) | 5 (71.4) | 4 (57.1) | 3 (42.9) | 5 (71.4) | 2 (28.6) |
3rd run | 3 (42.9) | 4 (57.1) | 4 (57.1) | 3 (42.9) | 5 (71.4) | 2 (28.6) |
Concordance among 3 runs | 5 (71.4) | 6 (85.7) | 7 (100) | |||
Total accuracy (%) | 33.3 | 57.1 | 71.4 | |||
Diagnosis (n) | 18 | 18 | 18 | |||
1st run | 9 (50) | 9 (50) | 15 (83.3) | 3 (16.7) | 13 (72.2) | 5 (27.8) |
2nd run | 8 (44.4) | 10 (55.6) | 15 (83.3) | 3 (16.7) | 14 (77.8) | 4 (22.2) |
3rd run | 11 (61,1) | 7 (38.9) | 15 (83.3) | 3 (16.7) | 15 (83.3) | 3 (16.7) |
Concordance among 3 runs | 12 (66.7) | 16 (88.9) | 16 (88.9) | |||
Total accuracy (%) | 51.9 | 83.3 | 77.8 | |||
Treatment (n) | 11 | 11 | 11 | |||
1st run | 11 (100) | 0 (0) | 10 (90.9) | 1 (9.1) | 9 (81.9) | 2 (18.1) |
2nd run | 10 (90.9) | 1 (9.1) | 10 (90.9) | 1 (9.1) | 9 (81.9) | 2 (18.1) |
3rd run | 10 (90.9) | 1 (9.1) | 11 (100) | 0 (0) | 9 (81.9) | 2 (18.1) |
Concordance among 3 runs | 10 (90.9) | 10 (90.9) | 11 (100) | |||
Total accuracy (%) | 93.9 | 93.9 | 81.9 | |||
Prevention (n) | 7 | 7 | 7 | |||
1st run | 4 (57.1) | 3 (42.9) | 6 (85.7) | 1 (14.3) | 3 (42.9) | 4 (57.1) |
2nd run | 3 (42.9) | 4 (57.1) | 4 (57.1) | 3 (42.9) | 3 (42.9) | 4 (57.1) |
3rd run | 3 (42.9) | 4 (57.1) | 4 (57.1) | 3 (42.9) | 3 (42.9) | 4 (57.1) |
Concordance among 3 runs | 6 (85.7) | 5 (71.4) | 7 (100) | |||
Total accuracy (%) | 47.6 | 66.7 | 42.9 | |||
Prognosis (n) | 4 | 4 | 4 | |||
1st run | 3 (75) | 1 (25) | 3 (75) | 1 (25) | 2 (50) | 2 (50) |
2nd run | 2 (50) | 2 (50) | 3 (75) | 1 (25) | 2 (50) | 2 (50) |
3rd run | 2 (50) | 2 (50) | 3 (75) | 1 (25) | 2 (50) | 2 (50) |
Concordance among 3 runs | 3 (75) | 4 (100) | 4 (100) | |||
Total accuracy (%) | 58.3 | 75 | 50 |
- Citation: Li Y, Huang CK, Hu Y, Zhou XD, He C, Zhong JW. Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study. World J Gastroenterol 2025; 31(3): 101092
- URL: https://www.wjgnet.com/1007-9327/full/v31/i3/101092.htm
- DOI: https://dx.doi.org/10.3748/wjg.v31.i3.101092