Published online Jul 15, 2023. doi: 10.4251/wjgo.v15.i7.1215
Peer-review started: February 1, 2023
First decision: March 21, 2023
Revised: March 31, 2023
Accepted: May 8, 2023
Article in press: May 8, 2023
Published online: July 15, 2023
Processing time: 161 Days and 2.5 Hours
Improving early diagnosis rates of gastric cancer (GC) is of great importance for reducing GC-related deaths. This study aimed to construct a predictive model for GC by integrating single-cell sequencing data and bulk RNA sequencing (bulk RNA-seq) data to identify potential targets for GC prediction.
Identifying predictive targets for GC is an important approach to reduce GC-related deaths, which is the driving force behind this study.
The objective of this study was to develop a predictive model for GC by combining single-cell sequencing data and bulk RNA-seq data and to identify potential targets for predicting GC.
We downloaded GC single-cell sequencing and bulk RNA-seq datasets from the Gene Expression Omnibus and University of California at Santa Cruz databases. The single-cell sequencing data were analyzed using the Seurat package, and the bulk RNA-seq data were analyzed using the limma package. The construction of the GC prediction model was based on the Least absolute shrinkage and selection operator (LASSO) and random forest methods. Survival analysis was conducted using the KM-PLOTTER online database.
By analyzing single-cell RNA sequencing data from 70707 cells from GC tissue, normal gastric tissue, and chronic gastric tissue, we identified 10 different cell types and screened for genes differentially expressed between GC and normal epithelial cells. After determining differentially expressed genes identified from batch RNA sequencing data of GC and normal gastric samples, we constructed a GC prediction classifier using LASSO and random forest methods. The LASSO classifier performed well when validated and when the model was verified using The Cancer Genome Atlas and Genotype-Tissue Expression datasets [area under the curve (AUC)_min = 0.988, AUC_1se = 0.994], and the random forest model also achieved good results with the validation set (AUC = 0.92). We identified genes such as TIMP1, PLOD3, CKS2, TYMP, TNFRSF10B, CPNE1, GDF15, BCAP31, and CLDN7 with significant importance in multiple GC prediction models, and KM-PLOTTER analysis showed their relevance to GC prognosis, indicating their potential value in GC diagnosis and treatment. However, the limitation of our study is the lack of clinical sample validation for the GC prediction models.
This study demonstrates that the combination of single-cell sequencing data and bulk RNA-seq data is feasible for constructing a GC prediction model.
Using single-nucleus sequencing to assist in constructing GC prediction models may lead to more reliable results, as it has advantages in identifying epithelial cells.