Copyright
©The Author(s) 2025.
World J Gastroenterol. Feb 14, 2025; 31(6): 102090
Published online Feb 14, 2025. doi: 10.3748/wjg.v31.i6.102090
Published online Feb 14, 2025. doi: 10.3748/wjg.v31.i6.102090
Table 1 Fifteen questions on inflammatory bowel disease
Theme | Questions |
Introduction | What is IBD? |
Why did I get IBD? | |
Does IBD run in families? | |
Diagnosis | What are the common symptoms of IBD? |
How is IBD diagnosed? | |
What diseases need to be differentiated and diagnosed from IBD? | |
Treatment | What medications are used to treat IBD? |
What are the potential side effects of common IBD medications? | |
Will IBD need medication for life long? | |
Can IBD be cured? | |
Follow-up | What are the signs and symptoms of a flare-up? |
What dietary changes are recommended for people with IBD? | |
What are the potential complications of IBD? | |
Does IBD increase the risk of colon cancer? | |
Does having IBD affect fertility? |
Table 2 Medical experts evaluation scores criteria
Evaluation dimension | Score | Scoring criteria |
Accuracy | 1 | The answer contains serious errors or misleading information that may harm patients |
2 | The answer contains some errors or inaccurate information, but it will not cause obvious harm to patients | |
3 | The information in the answer is mostly accurate, but there are a few ambiguous or uncertain statements | |
4 | The information in the answer is accurate, clearly stated, and without obvious errors | |
5 | The information in the answer is highly accurate, professionally and authoritatively stated, and fully consistent with current medical knowledge | |
Completeness | 1 | The answer is very brief, missing key information, and provides little to no help for the patient's actual question |
2 | Although the answer mentions some relevant content, it lacks a significant amount of important information and provides limited help to the patient | |
3 | The answer covers the main relevant content but still omits some important information, making the guidance for the patient not comprehensive enough | |
4 | The answer covers most of the key content, and although it may omit a small amount of minor information, it is already very helpful to the patient | |
5 | The answer is very complete, covering all key information and providing a comprehensive answer to the patient's question | |
Correlation | 1 | The answer is almost completely unrelated to the patient's actual question, and the information lacks targeting |
2 | Although some content in the answer is related to the question, most of the information deviates from the main topic and lacks targeting | |
3 | The main point of the answer is basically related to the patient's question, but there is a small amount of irrelevant or off-topic content | |
4 | The answer closely follows the patient's question, and almost all content is directly related, but there may be a few pieces of irrelevant information | |
5 | The answer is completely on-topic, and all content is highly relevant to the patient's question, making the information very targeted |
Table 3 Patients evaluation scores criteria
Score | Scoring criteria |
1 | Completely do not understand, unable to understand the content of the answer |
2 | Difficult to understand, hard to understand the content of the answer, requiring further explanation |
3 | Partially understand, able to understand some of the content of the answer, but there is some confusion |
4 | Basically understand, able to understand most of the content of the answer, but still have some doubts |
5 | Fully understand, able to accurately summarize the content of the answer and resolve all doubts |
Table 4 Mean scores for answers from three large language models
Groups | Items | ChatGPT-4.0 | Gemini-1.5-Pro | Claude-3-Opus |
Expert assessment | Accuracy, mean (SD) | 4.06 (0.61) | 4.06 (0.62) | 4.02 (0.66) |
Completeness, mean (SD) | 4.24 (0.64) | 4.20 (0.66) | 4.27 (0.58) | |
Correlation, mean (SD) | 4.57 (0.62) | 4.54 (0.66) | 4.52 (0.66) | |
Patient assessment | Comprehensibility, mean (SD) | 4.02 (0.75) | 4.07 (0.75) | 4.56 (0.66) |
Objective evaluation | FRE score, mean (SD) | 32.25 (6.91) | 36.92 (8.99) | 54.44 (8.22) |
Table 5 Median scores for answers from three large language models
Groups | Items | ChatGPT-4.0 | Gemini-1.5-Pro | Claude-3-Opus |
Expert assessment | Accuracy, median (Q1, Q3) | 4 (4, 4) | 4 (4, 4) | 3 (3, 4) |
Completeness, median (Q1, Q3) | 4 (4, 5) | 4 (4, 5) | 4 (3, 4) | |
Correlation, median (Q1, Q3) | 5 (4, 5) | 5 (4, 5) | 4 (3, 4) | |
Patient assessment | Comprehensibility, median (Q1, Q3) | 4 (3, 5) | 4 (4, 5) | 5 (4, 5) |
Objective evaluation | FRE score, median (Q1, Q3) | 31.10 (27.50, 34.30) | 32.79 (30.53, 42.61) | 51.47 (49.82, 56.09) |
- Citation: Zhang Y, Wan XH, Kong QZ, Liu H, Liu J, Guo J, Yang XY, Zuo XL, Li YQ. Evaluating large language models as patient education tools for inflammatory bowel disease: A comparative study. World J Gastroenterol 2025; 31(6): 102090
- URL: https://www.wjgnet.com/1007-9327/full/v31/i6/102090.htm
- DOI: https://dx.doi.org/10.3748/wjg.v31.i6.102090