Copyright
©The Author(s) 2025.
World J Gastroenterol. Jan 21, 2025; 31(3): 101092
Published online Jan 21, 2025. doi: 10.3748/wjg.v31.i3.101092
Published online Jan 21, 2025. doi: 10.3748/wjg.v31.i3.101092
Table 1 Quality indicators (scientific adequacy) for answers from ChatGPT-3.5, ChatGPT-4.0, and Google Gemini
Common questions | Sources of answers | Answer lengths, 1st run | Answer lengths, 2nd run | Answer lengths, 3rd run | Grades, mean | Grades, P value |
Overall (mean) | ChatGPT-3.5 | 275 | 366 | 352 | 3.50 | 0.2 |
ChatGPT-4.0 | 274 | 252 | 238 | 3.69 | ||
Google Gemini | 307 | 322 | 325 | 3.53 | ||
Risk factors | ||||||
What are the transmission modes of hepatitis B virus? | ChatGPT-3.5 | 189 | 316 | 400 | 3.67 | 0.296 |
ChatGPT-4.0 | 358 | 241 | 220 | 4 | ||
Google Gemini | 264 | 291 | 291 | 3.33 | ||
Clinical manifestation | ||||||
What are the symptoms of hepatitis B infection? | ChatGPT-3.5 | 247 | 333 | 356 | 3.67 | 0.216 |
ChatGPT-4.0 | 269 | 276 | 295 | 3.67 | ||
Google Gemini | 226 | 349 | 352 | 3 | ||
Diagnosis | ||||||
What is the most accurate test for diagnosing Hepatitis B infection? | ChatGPT-3.5 | 223 | 341 | 348 | 3.67 | 0.027 |
ChatGPT-4.0 | 307 | 349 | 280 | 4 | ||
Google Gemini | 281 | 281 | 281 | 3 | ||
Treatment | ||||||
Can hepatitis B infection be cured clinically? | ChatGPT-3.5 | 334 | 357 | 395 | 3.67 | 0.216 |
ChatGPT-4.0 | 271 | 324 | 264 | 3.67 | ||
Google Gemini | 268 | 360 | 359 | 3 | ||
What are the indications of antiviral therapy for patients infected with hepatitis B virus? | ChatGPT-3.5 | 368 | 367 | 402 | 3.67 | 0.296 |
ChatGPT-4.0 | 351 | 334 | 296 | 3.33 | ||
Google Gemini | 385 | 384 | 392 | 3 | ||
Can patients infected with hepatitis B virus be pregnant during antiviral treatment? | ChatGPT-3.5 | 341 | 392 | 383 | 3.33 | 0.079 |
ChatGPT-4.0 | 319 | 247 | 242 | 4 | ||
Google Gemini | 369 | 352 | 351 | 4 | ||
Do patients diagnosed with chronic hepatitis B during pregnancy need antiviral therapy? | ChatGPT-3.5 | 366 | 416 | 383 | 3 | 0.296 |
ChatGPT-4.0 | 230 | 313 | 256 | 3.33 | ||
Google Gemini | 325 | 330 | 375 | 3.67 | ||
Can patients diagnosed with chronic hepatitis B during lactation be treated with antiviral therapy? | ChatGPT-3.5 | 366 | 419 | 391 | 3.33 | 0.296 |
ChatGPT-4.0 | 245 | 190 | 218 | 3.67 | ||
Google Gemini | 362 | 328 | 330 | 4 | ||
Prevention | ||||||
How long should a newborn receive the first dose of hepatitis B vaccine after birth? | ChatGPT-3.5 | 133 | 392 | 185 | 3.67 | 0.296 |
ChatGPT-4.0 | 182 | 146 | 146 | 3.33 | ||
Google Gemini | 193 | 207 | 201 | 4 | ||
Can pregnant women receive hepatitis B vaccine? | ChatGPT-3.5 | 181 | 397 | 338 | 4 | |
ChatGPT-4.0 | 179 | 183 | 149 | 4 | ||
Google Gemini | 277 | 318 | 318 | 4 | ||
How often should patients with hepatitis B virus infection be reexamined? | ChatGPT-3.5 | 209 | 421 | 328 | 3 | 0.027 |
ChatGPT-4.0 | 275 | 171 | 205 | 3.33 | ||
Google Gemini | 330 | 334 | 334 | 4 | ||
Prognosis | ||||||
What are the complications of hepatitis B infection? | ChatGPT-3.5 | 343 | 235 | 305 | 3.33 | 0.216 |
ChatGPT-4.0 | 300 | 245 | 280 | 4.00 | ||
Google Gemini | 405 | 326 | 318 | 3.33 |
Table 2 Performance of ChatGPT-3.5, ChatGPT-4.0 and Google Gemini on hepatitis B infection test questions by different subfields, n (%)
Test questions by subfields | ChatGPT-3.5, correct | ChatGPT-3.5, incorrect | ChatGPT-4.0, correct | ChatGPT-4.0, incorrect | Google Gemini, correct | Google Gemini, incorrect |
All test questions | 52 | 52 | 52 | |||
1st run | 34 (65.4) | 18 (34.6) | 43 (82.7) | 9 (17.3) | 37 (71.1) | 15 (28.9) |
2nd run | 30 (57.7) | 22 (42.3) | 41 (78.9) | 11 (21.1) | 38 (73.1) | 14 (26.9) |
3rd run | 34 (65.4) | 18 (34.6) | 42 (80.8) | 10 (19.2) | 39 (75) | 13 (25) |
Concordance among 3 runs | 41 (78.9) | 46 (88.4) | 50 (96.2) | |||
Total accuracy (%) | 62.9 | 80.8 | 73.1 | |||
Risk factors (n) | 5 | 5 | 5 | |||
1st run | 5 (100) | 0 (0) | 5 (100) | 0 (0) | 5 (100) | 0 (0) |
2nd run | 5 (100) | 0 (0) | 5 (100) | 0 (0) | 5 (100) | 0 (0) |
3rd run | 5 (100) | 0 (0) | 5 (100) | 0 (0) | 5 (100) | 0 (0) |
Concordance among 3 runs | 5 (100) | 5 (100) | 5 (100) | |||
Total accuracy (%) | 100 | 100 | 100 | |||
Clinical manifestation | 7 | 7 | 7 | |||
1st run | 2 (40) | 5 (71.4) | 4 (57.1) | 3 (42.9) | 5 (71.4) | 2 (28.6) |
2nd run | 2 (40) | 5 (71.4) | 4 (57.1) | 3 (42.9) | 5 (71.4) | 2 (28.6) |
3rd run | 3 (42.9) | 4 (57.1) | 4 (57.1) | 3 (42.9) | 5 (71.4) | 2 (28.6) |
Concordance among 3 runs | 5 (71.4) | 6 (85.7) | 7 (100) | |||
Total accuracy (%) | 33.3 | 57.1 | 71.4 | |||
Diagnosis (n) | 18 | 18 | 18 | |||
1st run | 9 (50) | 9 (50) | 15 (83.3) | 3 (16.7) | 13 (72.2) | 5 (27.8) |
2nd run | 8 (44.4) | 10 (55.6) | 15 (83.3) | 3 (16.7) | 14 (77.8) | 4 (22.2) |
3rd run | 11 (61,1) | 7 (38.9) | 15 (83.3) | 3 (16.7) | 15 (83.3) | 3 (16.7) |
Concordance among 3 runs | 12 (66.7) | 16 (88.9) | 16 (88.9) | |||
Total accuracy (%) | 51.9 | 83.3 | 77.8 | |||
Treatment (n) | 11 | 11 | 11 | |||
1st run | 11 (100) | 0 (0) | 10 (90.9) | 1 (9.1) | 9 (81.9) | 2 (18.1) |
2nd run | 10 (90.9) | 1 (9.1) | 10 (90.9) | 1 (9.1) | 9 (81.9) | 2 (18.1) |
3rd run | 10 (90.9) | 1 (9.1) | 11 (100) | 0 (0) | 9 (81.9) | 2 (18.1) |
Concordance among 3 runs | 10 (90.9) | 10 (90.9) | 11 (100) | |||
Total accuracy (%) | 93.9 | 93.9 | 81.9 | |||
Prevention (n) | 7 | 7 | 7 | |||
1st run | 4 (57.1) | 3 (42.9) | 6 (85.7) | 1 (14.3) | 3 (42.9) | 4 (57.1) |
2nd run | 3 (42.9) | 4 (57.1) | 4 (57.1) | 3 (42.9) | 3 (42.9) | 4 (57.1) |
3rd run | 3 (42.9) | 4 (57.1) | 4 (57.1) | 3 (42.9) | 3 (42.9) | 4 (57.1) |
Concordance among 3 runs | 6 (85.7) | 5 (71.4) | 7 (100) | |||
Total accuracy (%) | 47.6 | 66.7 | 42.9 | |||
Prognosis (n) | 4 | 4 | 4 | |||
1st run | 3 (75) | 1 (25) | 3 (75) | 1 (25) | 2 (50) | 2 (50) |
2nd run | 2 (50) | 2 (50) | 3 (75) | 1 (25) | 2 (50) | 2 (50) |
3rd run | 2 (50) | 2 (50) | 3 (75) | 1 (25) | 2 (50) | 2 (50) |
Concordance among 3 runs | 3 (75) | 4 (100) | 4 (100) | |||
Total accuracy (%) | 58.3 | 75 | 50 |
Table 3 Comparison of readability of answers from ChatGPT-3.5 with the 8th grade reading level, mean ± SD
Subfield | GFI | P value | FKGL | P value |
Risk factors | 16.73 ± 1.77 | 0.013 | 12.60 ± 1.72 | 0.043 |
Clinical manifestation | 13.68 ± 2.12 | 0.043 | 10.75 ± 1.72 | 0.109 |
Diagnosis | 15.46 ± 1.65 | 0.016 | 12.12 ± 1.61 | 0.048 |
Treatment | 21.22 ± 1.99 | < 0.001 | 17.22 ± 1.47 | < 0.001 |
Prevention | 18.89 ± 1.80 | < 0.001 | 15.53 ± 1.72 | < 0.001 |
Prognosis | 18.52 ± 1.85 | 0.010 | 15.51 ± 2.17 | 0.027 |
Overall | 18.93 ± 3.03 | < 0.001 | 15.31 ± 2.67 | < 0.001 |
Table 4 Comparison of readability of answers from ChatGPT-4.0 with the 8th grade reading level, mean ± SD
Subfield | GFI | P value | FKGL | P value |
Risk factors | 14.79 ± 0.24 | < 0.001 | 11.45 ± 0.35 | 0.003 |
Clinical manifestation | 11.05 ± 0.89 | 0.027 | 9.06 ± 0.73 | 0.130 |
Diagnosis | 14.40 ± 0.42 | 0.001 | 11.28 ± 0.47 | < 0.001 |
Treatment | 18.18 ± 1.45 | < 0.001 | 14.57 ± 1.27 | < 0.001 |
Prevention | 16.49 ± 1.27 | < 0.001 | 13.49 ± 1.09 | < 0.001 |
Prognosis | 16.10 ± 0.52 | 0.001 | 13.18 ± 0.05 | < 0.001 |
Overall | 16.39 ± 2.38 | < 0.001 | 13.19 ± 1.96 | < 0.001 |
Table 5 Comparison of readability of answers from Google Gemini with the 8th grade reading level, mean ± SD
Subfield | GFI | P value | FKGL | P value |
Risk factors | 14.54 ± 0.46 | 0.002 | 10.73 ± 0.16 | 0.001 |
Clinical manifestation | 13.06 ± 0.42 | 0.002 | 9.81 ± 0.68 | 0.043 |
Diagnosis | 17.71 ± 0.30 | < 0.001 | 13.54 ± 0.24 | < 0.001 |
Treatment | 19.93 ± 1.44 | < 0.001 | 15.65 ± 1.06 | < 0.001 |
Prevention | 15.63 ± 1.96 | < 0.001 | 11.71 ± 1.82 | < 0.001 |
Prognosis | 14.81 ± 0.62 | 0.003 | 12.37 ± 0.27 | 0.001 |
Overall | 17.22 ± 2.86 | < 0.001 | 13.32 ± 2.44 | < 0.001 |
- Citation: Li Y, Huang CK, Hu Y, Zhou XD, He C, Zhong JW. Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study. World J Gastroenterol 2025; 31(3): 101092
- URL: https://www.wjgnet.com/1007-9327/full/v31/i3/101092.htm
- DOI: https://dx.doi.org/10.3748/wjg.v31.i3.101092