Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study

doi:10.3748/wjg.v31.i3.101092

Advanced Search

BPG is committed to discovery and dissemination of knowledge

Home / Archive / Volume 31, Issue 3

This Article

Academic Content and Language Evaluation of This Article

CrossCheck and Google Search of This Article

Academic Rules and Norms of This Article

Supplementary Materials of This Article

Citation of this article

Corresponding Author of This Article

Research Domain of This Article

Article-Type of This Article

Open-Access Policy of This Article

Times Cited Counts in Google of This Article

Number of Hits and Downloads for This Article

Total Article Views (3139)

All Articles published online

The chart showing PDF series, HTML series, Figures (1-1) series, Tables (1-5) series.

Item

Count

PDF

HTML

1649

Figures (1-1)

404

Tables (1-5)

404

Sum=2538

Publishing Process of This Article

The chart showing Browse series, Download series.

Item

Count

Browse

Download

371

Sum=441

Jan 21, 2025 (publication date) through Aug 28, 2025

Times Cited of This Article

Times Cited (2)

Journal Information of This Article

Publication Name

World Journal of Gastroenterology

ISSN

1007-9327

Publisher of This Article

Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA

Basic Study

World J Gastroenterol. Jan 21, 2025; 31(3): 101092
Published online Jan 21, 2025. doi: 10.3748/wjg.v31.i3.101092

Table 1 Quality indicators (scientific adequacy) for answers from ChatGPT-3.5, ChatGPT-4.0, and Google Gemini

Common questions	Sources of answers	Answer lengths, 1^st run	Answer lengths, 2^nd run	Answer lengths, 3^rd run	Grades, mean	Grades, P value
Overall (mean)	ChatGPT-3.5	275	366	352	3.50	0.2
	ChatGPT-4.0	274	252	238	3.69
	Google Gemini	307	322	325	3.53
Risk factors
What are the transmission modes of hepatitis B virus?	ChatGPT-3.5	189	316	400	3.67	0.296
	ChatGPT-4.0	358	241	220	4
	Google Gemini	264	291	291	3.33
Clinical manifestation
What are the symptoms of hepatitis B infection?	ChatGPT-3.5	247	333	356	3.67	0.216
	ChatGPT-4.0	269	276	295	3.67
	Google Gemini	226	349	352	3
Diagnosis
What is the most accurate test for diagnosing Hepatitis B infection?	ChatGPT-3.5	223	341	348	3.67	0.027
	ChatGPT-4.0	307	349	280	4
	Google Gemini	281	281	281	3
Treatment
Can hepatitis B infection be cured clinically?	ChatGPT-3.5	334	357	395	3.67	0.216
	ChatGPT-4.0	271	324	264	3.67
	Google Gemini	268	360	359	3
What are the indications of antiviral therapy for patients infected with hepatitis B virus?	ChatGPT-3.5	368	367	402	3.67	0.296
	ChatGPT-4.0	351	334	296	3.33
	Google Gemini	385	384	392	3
Can patients infected with hepatitis B virus be pregnant during antiviral treatment?	ChatGPT-3.5	341	392	383	3.33	0.079
	ChatGPT-4.0	319	247	242	4
	Google Gemini	369	352	351	4
Do patients diagnosed with chronic hepatitis B during pregnancy need antiviral therapy?	ChatGPT-3.5	366	416	383	3	0.296
	ChatGPT-4.0	230	313	256	3.33
	Google Gemini	325	330	375	3.67
Can patients diagnosed with chronic hepatitis B during lactation be treated with antiviral therapy?	ChatGPT-3.5	366	419	391	3.33	0.296
	ChatGPT-4.0	245	190	218	3.67
	Google Gemini	362	328	330	4
Prevention
How long should a newborn receive the first dose of hepatitis B vaccine after birth?	ChatGPT-3.5	133	392	185	3.67	0.296
	ChatGPT-4.0	182	146	146	3.33
	Google Gemini	193	207	201	4
Can pregnant women receive hepatitis B vaccine?	ChatGPT-3.5	181	397	338	4
	ChatGPT-4.0	179	183	149	4
	Google Gemini	277	318	318	4
How often should patients with hepatitis B virus infection be reexamined?	ChatGPT-3.5	209	421	328	3	0.027
	ChatGPT-4.0	275	171	205	3.33
	Google Gemini	330	334	334	4
Prognosis
What are the complications of hepatitis B infection?	ChatGPT-3.5	343	235	305	3.33	0.216
	ChatGPT-4.0	300	245	280	4.00
	Google Gemini	405	326	318	3.33

Table 2 Performance of ChatGPT-3.5, ChatGPT-4.0 and Google Gemini on hepatitis B infection test questions by different subfields, n (%)

Test questions by subfields	ChatGPT-3.5, correct	ChatGPT-3.5, incorrect	ChatGPT-4.0, correct	ChatGPT-4.0, incorrect	Google Gemini, correct	Google Gemini, incorrect
All test questions	52		52		52
1^st run	34 (65.4)	18 (34.6)	43 (82.7)	9 (17.3)	37 (71.1)	15 (28.9)
2^nd run	30 (57.7)	22 (42.3)	41 (78.9)	11 (21.1)	38 (73.1)	14 (26.9)
3^rd run	34 (65.4)	18 (34.6)	42 (80.8)	10 (19.2)	39 (75)	13 (25)
Concordance among 3 runs	41 (78.9)		46 (88.4)		50 (96.2)
Total accuracy (%)	62.9		80.8		73.1
Risk factors (n)	5		5		5
1^st run	5 (100)	0 (0)	5 (100)	0 (0)	5 (100)	0 (0)
2^nd run	5 (100)	0 (0)	5 (100)	0 (0)	5 (100)	0 (0)
3^rd run	5 (100)	0 (0)	5 (100)	0 (0)	5 (100)	0 (0)
Concordance among 3 runs	5 (100)		5 (100)		5 (100)
Total accuracy (%)	100		100		100
Clinical manifestation (n)	7		7		7
1^st run	2 (40)	5 (71.4)	4 (57.1)	3 (42.9)	5 (71.4)	2 (28.6)
2^nd run	2 (40)	5 (71.4)	4 (57.1)	3 (42.9)	5 (71.4)	2 (28.6)
3^rd run	3 (42.9)	4 (57.1)	4 (57.1)	3 (42.9)	5 (71.4)	2 (28.6)
Concordance among 3 runs	5 (71.4)		6 (85.7)		7 (100)
Total accuracy (%)	33.3		57.1		71.4
Diagnosis (n)	18		18		18
1^st run	9 (50)	9 (50)	15 (83.3)	3 (16.7)	13 (72.2)	5 (27.8)
2^nd run	8 (44.4)	10 (55.6)	15 (83.3)	3 (16.7)	14 (77.8)	4 (22.2)
3^rd run	11 (61,1)	7 (38.9)	15 (83.3)	3 (16.7)	15 (83.3)	3 (16.7)
Concordance among 3 runs	12 (66.7)		16 (88.9)		16 (88.9)
Total accuracy (%)	51.9		83.3		77.8
Treatment (n)	11		11		11
1^st run	11 (100)	0 (0)	10 (90.9)	1 (9.1)	9 (81.9)	2 (18.1)
2^nd run	10 (90.9)	1 (9.1)	10 (90.9)	1 (9.1)	9 (81.9)	2 (18.1)
3^rd run	10 (90.9)	1 (9.1)	11 (100)	0 (0)	9 (81.9)	2 (18.1)
Concordance among 3 runs	10 (90.9)		10 (90.9)		11 (100)
Total accuracy (%)	93.9		93.9		81.9
Prevention (n)	7		7		7
1^st run	4 (57.1)	3 (42.9)	6 (85.7)	1 (14.3)	3 (42.9)	4 (57.1)
2^nd run	3 (42.9)	4 (57.1)	4 (57.1)	3 (42.9)	3 (42.9)	4 (57.1)
3^rd run	3 (42.9)	4 (57.1)	4 (57.1)	3 (42.9)	3 (42.9)	4 (57.1)
Concordance among 3 runs	6 (85.7)		5 (71.4)		7 (100)
Total accuracy (%)	47.6		66.7		42.9
Prognosis (n)	4		4		4
1^st run	3 (75)	1 (25)	3 (75)	1 (25)	2 (50)	2 (50)
2^nd run	2 (50)	2 (50)	3 (75)	1 (25)	2 (50)	2 (50)
3^rd run	2 (50)	2 (50)	3 (75)	1 (25)	2 (50)	2 (50)
Concordance among 3 runs	3 (75)		4 (100)		4 (100)
Total accuracy (%)	58.3		75		50

Table 3 Comparison of readability of answers from ChatGPT-3.5 with the 8^th grade reading level, mean ± SD

Subfield	GFI	P value	FKGL	P value
Risk factors	16.73 ± 1.77	0.013	12.60 ± 1.72	0.043
Clinical manifestation	13.68 ± 2.12	0.043	10.75 ± 1.72	0.109
Diagnosis	15.46 ± 1.65	0.016	12.12 ± 1.61	0.048
Treatment	21.22 ± 1.99	< 0.001	17.22 ± 1.47	< 0.001
Prevention	18.89 ± 1.80	< 0.001	15.53 ± 1.72	< 0.001
Prognosis	18.52 ± 1.85	0.010	15.51 ± 2.17	0.027
Overall	18.93 ± 3.03	< 0.001	15.31 ± 2.67	< 0.001

GFI: Gunning Fog index; FKGL: Flesch-Kincaid grade level.

Table 4 Comparison of readability of answers from ChatGPT-4.0 with the 8^th grade reading level, mean ± SD

Subfield	GFI	P value	FKGL	P value
Risk factors	14.79 ± 0.24	< 0.001	11.45 ± 0.35	0.003
Clinical manifestation	11.05 ± 0.89	0.027	9.06 ± 0.73	0.130
Diagnosis	14.40 ± 0.42	0.001	11.28 ± 0.47	< 0.001
Treatment	18.18 ± 1.45	< 0.001	14.57 ± 1.27	< 0.001
Prevention	16.49 ± 1.27	< 0.001	13.49 ± 1.09	< 0.001
Prognosis	16.10 ± 0.52	0.001	13.18 ± 0.05	< 0.001
Overall	16.39 ± 2.38	< 0.001	13.19 ± 1.96	< 0.001

GFI: Gunning Fog index; FKGL: Flesch-Kincaid grade level.

Table 5 Comparison of readability of answers from Google Gemini with the 8^th grade reading level, mean ± SD

Subfield	GFI	P value	FKGL	P value
Risk factors	14.54 ± 0.46	0.002	10.73 ± 0.16	0.001
Clinical manifestation	13.06 ± 0.42	0.002	9.81 ± 0.68	0.043
Diagnosis	17.71 ± 0.30	< 0.001	13.54 ± 0.24	< 0.001
Treatment	19.93 ± 1.44	< 0.001	15.65 ± 1.06	< 0.001
Prevention	15.63 ± 1.96	< 0.001	11.71 ± 1.82	< 0.001
Prognosis	14.81 ± 0.62	0.003	12.37 ± 0.27	0.001
Overall	17.22 ± 2.86	< 0.001	13.32 ± 2.44	< 0.001

GFI: Gunning Fog index; FKGL: Flesch-Kincaid grade level.

Citation: Li Y, Huang CK, Hu Y, Zhou XD, He C, Zhong JW. Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study. World J Gastroenterol 2025; 31(3): 101092
URL: https://www.wjgnet.com/1007-9327/full/v31/i3/101092.htm
DOI: https://dx.doi.org/10.3748/wjg.v31.i3.101092