Application of Big Data analysis in gastrointestinal research

doi:10.3748/wjg.v25.i24.2990

Advanced Search

BPG is committed to discovery and dissemination of knowledge

Home / Archive / Volume 25, Issue 24

This Article

Academic Content and Language Evaluation of This Article

CrossCheck and Google Search of This Article

Academic Rules and Norms of This Article

Citation of this article

Corresponding Author of This Article

Research Domain of This Article

Article-Type of This Article

Open-Access Policy of This Article

Times Cited Counts in Google of This Article

Number of Hits and Downloads for This Article

Total Article Views (15438)

All Articles published online

The chart showing PDF series, WORD series, HTML series, Tables (1-7) series.

Item

Count

PDF

686

WORD

407

HTML

11264

Tables (1-7)

517

Sum=12874

Publishing Process of This Article

The chart showing Browse series, Download series.

Item

Count

Browse

1121

Download

1443

Sum=2564

Jun 28, 2019 (publication date) through Jul 15, 2025

Times Cited of This Article

Times Cited (43)

Journal Information of This Article

Publication Name

World Journal of Gastroenterology

ISSN

1007-9327

Publisher of This Article

Baishideng Publishing Group Inc, 7041 Koll Center Parkway, Suite 160, Pleasanton, CA 94566, USA

Review

World J Gastroenterol. Jun 28, 2019; 25(24): 2990-3008
Published online Jun 28, 2019. doi: 10.3748/wjg.v25.i24.2990

Table 1 Advantages and shortcomings of Big Data analysis (with proposed solutions)

Advantages
Clinical data readily available with minimal resources required
Can study rare exposures
Can study rare events
Can study long-term effects
Real-world data
Large sample size
Subgroup analysis
Sensitivity analysis
Interaction of different variables
Adjustment of outcome to a multitude of risk factors
Precise estimation of effect size
Reliable capture of small variations in incidence or disease flare
No selection bias if n = all
Shortcomings specific of Big Data analysis	Solution
Data validity	Cross reference with medical records in a subset of the sample
Missing data	Statistical methods to deal with missing data, e.g. multiple imputation
Missing data	Text mining or natural language processing of unstructured data
Incomplete capture of variables or unavailability of certain diagnosis codes	Surrogate markers (e.g., COPD for smoking, alcohol-related diseases for alcoholism)
	Inclusion of a large set of measured variables
	Text mining or natural language processing of unstructured data
Privacy	De-identification of individuals
Privacy	Review of study plan by local ethics committee
Hypothesis-free predictive models	Validation in prospective studies or randomized control trials
Shortcomings of all observational study including Big Data analysis	Solution
Residual and/or unmeasured confounding	Inclusion of a large set of measured variables
	Inclusion of RCT datasets with extensive collection of data and outcomes for trial participants or linkage with other data sources
	Fulfilment of Bradford Hill criteria
Reverse causality/protopathic bias (outcome of interest leads to exposure of interest)	Cohort study design instead of case-control study design
	Excluding prescriptions of drugs of interest (e.g., PPIs) within a certain period (e.g., 6 mo) before development of the outcome of interest (e.g., gastric cancer)
Example: Early symptoms of undiagnosed GC leads to PPI use, rather than PPIs cause GC
Selection bias	Encompassing entire study population (n = all)
Indication bias (or confounding by indication/disease severity)	Balance of patient characteristics, in particular comorbidities that are indications for a certain treatment (e.g., PS matching of a large set of measured variables)
	Negative control exposure
Confounding by functional status and cognitive impairment	Balance of patient characteristics, in particular comorbidities that can affect functional and cognitive status (e.g., PS matching)
Healthy user bias / adherer bias (individuals who are more health conscious tend to have better health outcomes)	Adjustment for other lifestyle factors – text mining or natural language processing of unstructured data
Immortal time bias (arises when the study outcome cannot occur during a period of follow-up due to study design)	Landmark analysis
	Analysis using time varying covariates
Ascertainment bias / surveillance bias / detection bias (differential degree of surveillance or screening for the outcome among exposed and unexposed individuals) Example: PPI users may undergo upper endoscopy more frequently than non-PPI users, and hence more GC detected in PPI users	Selection of an unexposed group with a similar likelihood of screening/testing
	Selection of an outcome that are likely to be diagnosed equally in exposed and control groups
	Adjustment for the surveillance rate
Access to healthcare	Stratified analysis according to patients’ residential regions (e.g., rural vs urban), socioeconomic status, immigration status, race/ethnicity, institutional factors (e.g., restrictive formularies)
Selective prescription and treatment in frail and very sick patients	PS methodology (trimming of areas of non-overlap, PS matching, PS by treatment interaction)

COPD: Chronic pulmonary obstructive disease; RCT: Randomized controlled trial; GC: Gastric cancer; PPI: Proton pump inhibitor; PS: Propensity score.

Citation: Cheung KS, Leung WK, Seto WK. Application of Big Data analysis in gastrointestinal research. World J Gastroenterol 2019; 25(24): 2990-3008
URL: https://www.wjgnet.com/1007-9327/full/v25/i24/2990.htm
DOI: https://dx.doi.org/10.3748/wjg.v25.i24.2990