INTRODUCTION
The use of artificial intelligence (AI) in gastroenterology has gained momentum in the past decade. This is reflected in the increasing number of publications in the field of AI in endoscopy, most of which have been centered on colonoscopy. This is understandable as the unique role of colonoscopy in the prevention and management of colorectal cancer (CRC), together with the unmet needs in this field, has created the perfect milieu for the introduction of AI into world of endoscopy.
CRC represents one of the leading causes of cancer-related morbidity and mortality worldwide[1,2]. Colonoscopy decreases CRC-related mortality[3,4], with a 1% increase in adenoma detection rate (ADR) estimated to decrease interval CRC by 3%[5]. As such, a key barrier to overcome is the adenoma miss rate (AMR), which has been estimated in a meta-analysis to be a high as 22% overall, with a higher AMR when diminutive adenomas are considered[6]. Another unmet need in colonoscopy is the need for accuracy in the optical diagnosis of colonic polyps in relation to their actual histology. Up to 90% of lesions detected on colonoscopy consist of diminutive (≤ 5 mm) and small (6-9 mm) polyps, with the progression rates to advanced adenomas or CRC postulated to be low based on evidence from available studies[7]. It is therefore no surprise that most of the literature to date has focused on computer-assisted detection (CADe)[8,9] and computer-assisted diagnosis (CADx)[10-12] applications in colonoscopy.
This review article evaluates the areas in colonoscopy where AI may be a bridge for certain gaps in clinical practice. It will also explore in detail the current limitations and pitfalls in the application of AI in colonoscopy, highlighting how despite the proliferation of literature on this topic and what it promises to offer, AI may be a new gap in endoscopy which clinicians need to work to bridge.
AI TERMINOLOGY IN COLONOSCOPY
What does the term AI mean in colonoscopy?
The term “artificial intelligence” was first coined by John McCarthy in 1956 at the Dartmouth Summer Research Project. In essence, it is a branch in computer science where computer systems are designed to perform tasks which would ordinarily require human intelligence. This definition is extremely broad and often confuses clinicians to what exactly the capabilities, and by inference, the limitations of AI are in their respective fields[13]. There is therefore a need to define what AI means in colonoscopy as this is a prerequisite for meaningful discussion of its role in colonoscopy.
Published and ongoing studies incorporating AI in the context of colonoscopy involve the machine learning (ML) domain of AI. ML refers to the use of algorithms, which form predictive and descriptive models based on analysis of input data provided by investigators (the training set)[14]. These algorithms undergo multiple iterations of these models with the goal of performing a specific task, the aim of which is to come to a specified classification output (e.g., polyp or no polyp) when the algorithms are tested on an unseen set of data (the test set). In practical terms and in the context of colonoscopy, this is achieved using either handcrafted models or deep learning (DL).
A useful mental model in understanding the scope of and roles which AI plays in colonoscopy is to regard the progress made in this field as “waves”[15]. It is crucial to understand that the methods, technologies, and results from earlier AI studies are not obsolete the moment a “better” or “faster” computer system is available based on results we as clinicians are familiar with such as the ADR and adenoma per colonoscopy (APC), or technical matrices that we may gravitate towards such as the processing speed of an algorithm. Rather, these “waves” are continuously interacting and building on top of each other, and as a result, have a strong influence on the development of later technologies. The earlier “waves” remain relevant and may sometimes harbor solutions to certain issues faced with CADe and CADx support tools, which will be discussed later in this article. Having this mental model also helps us better understand the intrinsic biases present in all forms of ML regardless of advancements made in AI, which is essential for critical appraisal of literature surrounding AI in clinical practice.
AI terminology relevant to colonoscopy
Commonly used terms in AI which are relevant to this review article will be discussed here. This list is not meant to be exhaustive and is meant instead to highlight terms which will help the reader understand the later critiques and solutions offered in this paper.
AI can be categorized very broadly into weak (or narrow) AI and strong AI. The former refers to systems built to solve a specific problem or performing a single task extremely well, without an emphasis on elucidating how human reasoning works. This type of AI operates within significant constraints and a limited context. The latter term, also referred to as artificial general intelligence, aims to build systems which think like humans.
Features in ML refer to the set of numbers which quantitatively summarize and represent in a compact fashion the input data. For example, differences in morphology of polyps as defined in the Paris classification[16] and pit patterns[17] can be converted into different arrays of numbers which an ML algorithm can use to generate a prediction such as “polyp” or “no polyp” in a CADe application. Conventional learning by the ML algorithm may be supervised, where training takes place on labeled data sets, or unsupervised, where commonalities are used to identify groups within data. Supervised learning occurs on pre-established input and output pairs, enabling the ML algorithm to learn predictive mathematical models which can then map the input from unseen data into an outcome of interest (e.g., neoplastic, or hyperplastic). In contrast, unsupervised learning predicts similarities between data points through looking at the underlying structure of the data provided, with no prior knowledge of its significance.
Handcrafted knowledge represents the first “wave” of AI. This consisted of knowledge-based methods where manual extraction and selection of characteristics of an object such as polyp shape and texture, are used to create mathematical models which can achieve a class or numerical output. This is labor-intensive and as a result, are usually implemented on small sets of data. These systems do not have the ability to learn and were of limited clinical use. DL is another form of ML where an artificial neural network (ANN) is used to perform the same task. ANNs are supervised ML models where interconnected artificial neurons form layered networks. Signals travel via weighted inputs from artificial neurons in the previous layer to the next layer, which then propagate the signal when a predefined threshold is reached, like how biological neurons work. Classification can be optimized, and the system enhanced by adjustment of the weights given to these inter-neuron connections.
Deep convolutional neural networks (DCNNs) have enabled more hidden layers to be added to the input and output layers of ANN, a development which has been facilitated by advancements made in other areas of computer science as this is computationally expansive. In addition, convolutional layers apply filters (a set of weights) in a systematic fashion to each overlapping part of the input data. In this manner, large numbers of filters can be applied to the training set of data in parallel under the constraints of the intended task, for example classification of an image as having a polyp or not in colonoscopy, allowing information to be extracted directly from images training data to form a feature map. DCNN usually require large amounts of labelled training data, which are derived wither from public databases or private collections in individual institutions.
Hyperparameters in ML refer to all parameters that have been arbitrarily set by the investigator and are used to configure the model for optimal performance at a specific task or on a specific dataset. As opposed to model parameters, which are learned automatically during training of the model, hyperparameters are manually set and affects the learning process and ultimately, the behavior of the model. This is useful in understanding the roles (and potential biases resulting from) the optimization and training process of AI models used in colonoscopy. The training set refers to the initial dataset used to determine optimal parameters after multiple rounds or iterations of adjustments. The validation set is mostly (but not always) a different dataset where these parameters are tested and adjusted. It is also used to optimize the hyperparameters in the model. Lastly, the test set refers to a new set of unseen data which is used to test the model and its generalizability.
AI: BRIDGING THE GAP IN COLONOSCOPY
AI in the field of colonoscopy has been studied primarily for polyp detection, polyp characterization in terms of predicted histology, and for quality assurance in the performance of colonoscopy.
Polyp detection
The rate of missed polyps was mentioned earlier in the introduction. The AMR in influenced by different factors, among which the endoscopist is considered one of the major determinants[18-21]. These human biases may be due to distraction during colonoscopy, fatigue, or the inability to maintain a sustained level of alertness during withdrawal. These lead to errors in perception where the endoscopist may miss polyps which are visible on the monitor. The role of “second readers” in colonoscopy in increasing ADR[22,23] lends support to the hypothesis that CADe may help increase APC and ADR, and decrease AMR, during colonoscopy.
At the time of writing, there are six randomized controlled trials (RCTs)[24-29] to date that have evaluated the role of CADe in colonoscopy. Hassan et al[9] recently performed a systematic review and meta-analysis of five of these studies[24,25,27-29], which consisted of 4354 participants. The pooled ADR was significantly higher in the CADe group compared with the control group (36.6% vs 25.2%; relative risk [RR] 1.44; 95% confidence interval [CI]: 1.27-1.62; P < 0.1), with all of the included RCTs reporting a significant increase in ADR individually. APC, which is defined as the total number of adenomas found divided by the total number of colonoscopies and has good correlation with ADR[30,31], was also significantly higher in the CADe compared to the control group (0.58 vs 0.36; RR 1.70; 95%CI: 1.53-1.89; P < 0.01). The mean withdrawal time in the CADe and control groups was shown to be statistically different in this meta-analysis.
An interesting prospective study conducted by Wang et al[32] showed that the AMR was decreased with CADe. This study differed from the RCT mentioned above in that tandem colonoscopies were performed. Patients in this study were randomly assigned to colonoscopy with CADe or colonoscopy without CADe by an endoscopist, followed immediately by the other procedure. The study showed that the AMR and polyp miss rates were significantly lower in the CADe colonoscopy group compared to the routine colonoscopy group (13.89% vs 40.00%, P < 0.0001 and 12.98% vs 45.90%; P < 0.0001, respectively). These results were also consistent regardless of colonic segments, i.e. the AMR was significantly lower in the CADe group in the ascending, transverse, and descending colon.
Polyp characterization (optical prediction of polyp histology)
In contrast to CADe for polyp detection, CADx deals with the interpretation of polyp appearance during colonoscopy to determine the predicted histology. Polyp classification systems such as the Kudo pit pattern[17], Sano et al[33], NBI International Colorectal Endoscopic (NICE)[34], and Japan NBI Expert Team (JNET)[35] classifications were developed with the purpose of predicting polyp histology and severity of neoplasia to guide therapy. The use of these classification systems for optical prediction of colorectal polyp histology requires the proper equipment, structured training, and experience in clinical application. Studies have shown wide variation in the sensitivity and specificity of NICE and JNET classifications, with most studies reporting a moderate interobserver agreement at best[36-39].
With the clinical use of CADe, the detection of diminutive polyps is likely to increase exponentially, as demonstrated in the CADe RCT mentioned[24,25,27-29]. Most diminutive polyps tend to be hyperplastic in nature with low malignant potential. The “resect and discard” and “detect and leave” strategies for such polyps were previously studied to address these issues before the emergence of AI but have failed to gain traction due to the need for better quality training and quality assurance in the accurate optical diagnosis of colon polyps[40-42]. The threshold for optical biopsy technologies in high confidence predictions established by the American Society for Gastrointestinal Endoscopy (ASGE) Preservation and Incorporation of Valuable Endoscopic Innovations (PIVI)[43] are deemed appropriate targets for CADx support tools[44]. A systematic review and meta-analysis by ASGE[45] showed that these thresholds were met using NBI only among NBI experts, illustrating the difficulty and practical limitations of replying on the use of these forms of imaging by endoscopists in general to achieve accurate optical diagnoses of colorectal polyps. Hence, this represents a significant clinical gap which AI has the potential to bridge in colonoscopy.
CADx is postulated to aid in this field of colorectal polyp management by using DL models to increase the accuracy of prediction of polyp histology during colonoscopy[46]. At the time of writing, there are currently no RCT evaluating CADx in colonoscopy. In a study by Jin et al[10], a DCNN was trained to differentiate between adenomatous and hyperplastic diminutive colorectal polyps with an overall accuracy of 86.7% using polyp histology as the gold standard. The system was tested on 22 endoscopists with varying expertise such as novice endoscopists, colonoscopy experts with differing levels of expertise in NBI, and NBI-trained experts. The use of CADx markedly improved the accuracy of novice endoscopists in differentiating adenomatous and hyperplastic polyps from 73.8% to 85.6% (P < 0.05), which was comparable to the baseline accuracy of NBI-trained experts (87.6%). However, in the colonoscopy expert and NBI-trained expert groups, this increase in accuracy was less impressive (83.8% to 89.0% and 87.6% to 90.0, respectively). The overall time to diagnosis per polyp was also decreased from 3.92 s to 3.37 s; P = 0.42).
A review of CADx predictions[47] for diminutive polyp histology which included 9 studies[48-56] showed a pooled sensitivity of 93.5% (95%CI: 90.7%-95.6%) and specificity of 90.8% (95%CI: 86.3%-95.9%), with a pooled area under the curve of 0.98. This pooled analysis of diminutive polyps had a negative predictive value (NPV) of 0.91 (95%CI: 0.89-0.94). This meets the 90% or greater threshold for NPV in adenomatous histology in rectosigmoid diminutive polyps recommended by the ASGE PIVI[43] and thus would in theory support a “diagnose and leave” strategy if these applications are validated in clinical use. However, most of these studies are retrospective in nature or, when conducted prospectively, involved the use of ex vivo video or still images.
Few prospective studies on CADx in real-time colonoscopy are currently available in the literature. In a single-center, open-label, prospective study of 791 consecutive patients undergoing colonoscopy in a university hospital, Mori et al[54] evaluated the performance of CADx in a clinical setting using endocytoscopy (CF-H290ECI; Olympus Corp, Tokyo, Japan). NBI was applied to visualize the microvascular pattern and methylene blue staining for cellular structure under these ultra-magnifying colonoscopes with 520X optical zoom capability. Of the 466 diminutive polyps found in this study, 250 polyps were in the rectosigmoid colon. The CADx system using endocytoscopy had an NPV for diminutive rectosigmoid adenomas ranging from 93.7% to 96.4% with methylene blue staining and 95.2% to 96.5% with NBI. This is well above the “diagnose and leave” threshold of 90% recommended by the ASGE PIVI[43] described. This prospective study also provides evidence for utilization of CADx for prediction of polyp histology in a clinical setting which may have an impact on decisions on polyp management real-time.
In an earlier study with a similar design by Horiuchi et al[56], CADx was evaluated with the use of autofluorescence imaging (AFI) to differentiate diminutive rectosigmoid polyps in real-time colonoscopies. The CADx system used software-based automatic color intensity analysis, which utilized AFI’s ability to differentiate polyps based on the ratio of green to red tone intensities and was tested on 258 rectosigmoid polyps in 95 patients undergoing colonoscopy. The CAD-AFI system achieved an NPV for adenomatous polyps of 93.4% (95%CI: 89.0%-96.4%), which again exceeds the 90% “diagnose and leave” threshold[43]. In addition, the NPV using CAD-AFI was comparable to that of diagnoses made by endoscopists using AFI in the study (94.9%; 95%CI: 90.8%-97.5%).
Quality assurance in colonoscopy
Quality indices such as a high cecal intubation rate and adequate withdrawal time have been studied extensively[57,58]. However, these quality indices in colonoscopy performance and reporting are not always adhered to for a variety of factors such as training, lack of real-time feedback and failure of enforcement[59-61]. In an RCT of 704 patients by Gong et al[26], which used an AI system called ENDOANGEL, the withdrawal speed and time, as well as the adequacy of mucosal exposure, was monitored in real-time and in an automated fashion. The resulted in a significantly longer withdrawal time in the ENDOANGEL[62] vs the control group (mean 6.38 min vs 4.76 min, respectively; P < 0.0001). This translated into an increased ADR in the ENDOANGEL group and, more significantly, is the only RCT to date which demonstrates an AI system which can increase the rate of detection of adenomas 10 mm or larger in size (10/355 vs 1/349, respectively; odds ratio [OR] 9.50, 95%CI: 1.19-75.75; P = 0.034). Su et al[28] used both a CADe tool together with an automatic quality control system (AQCS) to increase ADR and APC. The AQUS consisted of a timer on the monitor and audio prompts for the Endoscopist to slow down withdrawal speed when unstable and blurry frames were displayed or when the Boston Bowel Preparation Scale (BPPS) in a colonic segment was < 2. This study showed an improved withdrawal time (7.03 min vs 5.68 min; P < 0.001) and rate of adequate bowel preparation (87.34% vs 80.63%; P = 0.023) in the AQCS group in addition to the mentioned significant increase in ADR and APC.
AI: A GAP NEEDING TO BE BRIDGED IN COLONOSCOPY?
While AI has emerged in the world of endoscopy with much promise, there are several significant gaps which need to be bridged before it can be routinely applied in colonoscopy in a clinical setting.
Undefined and unspecified role in clinical environment
A major bridge which needs to be bridged before AI systems can be applied in routine environments is its generalizability. Three of the five CADe RCT[25,27,28] available involved senior endoscopists with extensive experience in colonoscopy. ADR is dependent on several factors, one of which includes experience. A more experienced endoscopist is not only skilled in recognition, but also in scope handling and consequent mucosal exposure during withdrawal. The role of a “second reader” in previous studies[22,23] in increasing small adenoma detection rates suggests that trainees and Nurses, who by inference have less “experience” than the senior endoscopist, have no issues recognizing a polyp visible on screen. In addition, as discussed in the ENDOANGEL study, one of the largest increments in ADR and the only increase in detection of adenomas larger then 10 mm was seen in the RCT by Gong et al[26], where real-time feedback on adequacy of mucosal exposure was studied. An obvious but less often mentioned fact is that any CADe algorithm is still completely dependent on the endoscopist to present optimal images with adequately exposed colonic mucosa in each real-time colonoscopy performed in a busy clinical setting. A polyp not visible on the screen will not be detected by a CADe tool, no matter how powerful the algorithm is[33]. This has implications on how generalizable available data is for clinical use, as more studies involving both “high detectors” and “low detectors” are required[25,63].
Most RCT in CADe to date were conducted in single centers. Moreover, except for the study by Wang et al[27] where a second monitor was used and visible only to an observer who reported the alerts, the rest of the RCT were non-blinded studies[24-26,28-29]. It is not known what the impact of the latter factor may be in actual clinical practice, as non-blinded endoscopists in these studies may put in more effort in exposing colonic mucosa for inspection when they are under observation. This Hawthorne Effect, together with the single-center experiences of most of these RCT, also limit their generalizability to routine clinical practice. While single monitors are encouraged[44] due to presumed gaze limitations of endoscopists and the need to reduce distractions, it is the opinion of the authors that a dual monitor setting in clinical trials plays a crucial role in achieving a double-blind and objective environment for assessment of the performance of the AI system and to bridge this gap. Furthermore, it resembles tandem colonoscopy in that the performance of the AI system can be compared directly against endoscopists of varying skill levels and experience. Useful information such as the AMR can be determined accurately without the patient having to go through an additional colonoscopy like in a traditional tandem study with this methodology.
Another limitation to the generalizability of the published results of AI systems for polyp detection and characterization is the differences in operational environments of different endoscopy suites and centers. These can vary greatly between institutions, even those located in the same country[64]. Unlike a new endoscopic method or classification system which can be taught or standardized in training or with major society guidelines, different AI algorithms have unique hardware and software requirements which must be fulfilled for technical integration into the operational environment. For instance, some may be fully integrated into the processing unit[65] while others may be web-based applications or require an additional laptop to be linked to the endoscopy stack to function. The latter may require cloud integration support, which in turn is likely to be vendor-specific and has implications in procurement and cybersecurity. This technical integration into the operational environment is key, as the development environment from which these AI systems are derived may be vastly different[66]. Most clinical trials understandably focus on the clinical aspects like the ADR and APC and the outcomes will inevitably be based on these primary objectives. However, few studies have reported the technical specifications and limitations of the AI systems they are investigating. The rare studies that do report them, do so in varying details, most of which are insufficient for interpretation and contextualization into the operational environment. Moreover, most of the published trials have been conducted in academic or expert centers and in several instances, in the same institutions where the AI algorithm was developed, i.e. the development and operational environment are the same[3,47]. Individual institutions may have difficulty integrating these systems due to budgeting constraints, existence of legacy systems which are incompatible with the software and hardware requirements of the AI systems, logistical limitations such as space, and established workflows in endoscopy which does not cater to the introduction of an AI system.
The current scope of AI applications in colonoscopy in the literature is also largely skewed towards to polyp detection, characterization, and assessment of adequacy of mucosal exposure, which is ultimately linked to ADR. When translated to clinical practice, this effectively confines the indications for which AI should be used in colonoscopy to CRC screening or indications where one might expect to find colorectal polyps in the process of performing a colonoscopy. All systems developed in the field of AI in colonoscopy, from handcrafted models to the most complex DCNN, are fundamentally “weak AI.” This is a term used to describe AI systems designed to solve a single problem or narrow task[15]. In a clinical setting, indications for colonoscopy are widely variable and the pre-test probability of finding of a polyp may be low. An endoscopist will be able to process the demographic data, clinical course, medical history, clinical condition, laboratory investigations and concerns of the patient and use this information during the colonoscopy. For example, an 85-year-old patient who is troubled by per rectal bleeding has a hugely different indication and clinical index of suspicion than a 50-year-old male with a family history of early CRC. In the former case, the endoscopist’s focus may be on looking for angiodysplasia, diverticular disease or hemorrhoids as the etiology. A “strong AI” system would be able to think and adapt like a human and calibrate the weights in its layers to perform the task at hand, determine the appropriate classification output and achieve the correct alarm settings. However, current AI systems will continue looking for polyps and may present a distraction to the Endoscopist if used in this clinical example, prolonging the time taken for colonoscopy in an elderly patient, who may have multiple co-morbidities and for whom resection of small or diminutive adenomas may not have clinical relevance, much less answer the clinical question at hand. A trainee endoscopist or an experienced nurse, on the other hand, would be able to immediately recognize an unusual finding, such as multiple angiodysplasia or extensive diverticular disease, even if they were not formally trained to recognize these abnormalities.
It should be noted that AI has also been studied in colonoscopy outside the context of polyp detection, characterization, and quality assurance. Endocytoscopy has been used with AI to accurately detect persistent histologic inflammation in patients with ulcerative colitis (UC) which was reproducible based on static images[67]. A separate group used a deep neural network to predict endoscopic and histologic remission in UC patients based on evaluation of static images obtained from colonoscopy with high accuracy[68]. However, studies looking at indications other than polyp detection and characterization are few and far between.
Technical biases and lack of technical knowledge among clinicians
There is significant variability and a lack of standardization in reporting of the technical aspects of AI algorithms in clinical trials[69]. In addition, clinicians may not have the technical knowledge to critically appraise AI literature given that this has not been a formal part of training or an emphasis in clinical practice until relatively recently. A “minimum reporting standard” and practical knowledge of terms and potential biases on the part of investigators and clinicians, respectively, is required to bridge these gaps[70-72].
A practical knowledge of commonly used terms and how AI systems are derived is necessary for the clinician to appreciate the technical biases inherent to these algorithms. While the inclusion criteria of patients in clinical trials is clearly defined, the criteria for inclusion of the input data for the AI system during training and validation may not always be included in the methodology. This is crucial as most AI systems for CADe were tested in the same centers where they were developed[73]. This is often due to the ease with which large amounts of data are readily available for training and validation. Although the training, validation, and test datasets may be different, they could be derived from the same database in a single, often expert, center, which is then split to form these datasets. The nature of the images used could be highly similar in terms of quality (e.g., no confounding fecal material and bubbles and polyps always centered in the image) and labelling (e.g., experts from different centers may mark out the most obvious abnormal area or delineate even the most minute detail which does not look like normal colonic mucosa for sessile serrated polyps depending on their level of skill and the training received, while experts from the same center are more likely to label lesions similarly). Prevalence and variability in presentations of disease may also differ depending on the populations studied, but the sample of images used in training and validating the AI algorithm may not necessarily reflect this natural variability of disease if data from a single center is used in the development of the AI system. This is a form of selection bias, as input data is not selected at random and hence is not fully representative of the study population in which the AI system is meant to function. This could impact the hyperparameters chosen during validation, and lead to overfitting, which occurs when the mathematical model derived is optimized to work on the training data and fits this data too tightly. This would limit its generalizability when new data is presented to the same AI algorithm.
Moreover, the proportion of “positive” to “normal” images used for training is not often mentioned in the published literature. For example, in a CADe application, polyps of various shapes, sizes and colors may be included in the training dataset to expose the AI algorithm to all possible eventualities when presented with an image with even the subtlest polyp. However, the “normal” images used may be disproportionately lower when compared to the natural prevalence of adenomas in the population. In addition, there may not be the same rigor in the selection of “normal” images for training. Variations in degrees of bowel preparation, bubbles, and artefacts due to the light source reflecting off normal colonic mucosa may thus not be reflected in images supplied to the AI algorithm for training. Positive and negative predictive values are determined by the prevalence of disease, and this may result in a higher proportion of false positives per true positive detected in clinical practice, depending on how the ratio of “positive” to “normal” images used in training compares with the true prevalence of the lesion of interest (e.g., polyps) in the study population. This is a factor which needs to be adjusted for in the AI algorithm[74].
A certain form of publication bias may also exist as clinicians who wish to publish on the topic of AI will search for references almost exclusively from medical journals. For example, meta-analysis and systematic reviews on the use of AI in colonoscopy may take a very clinical slant, while publications in computer science and engineering journals which may add technical dept to the chosen topic on AI being discussed will not be included. Even if a search were performed for these articles, the inclusion criteria for the literature search will inevitably involve clinical-based endpoints like ADR and APC, and almost always exclude publications from computer science and engineering journals as a result. The barrier to entry in medical journals for these studies is high, as editors and reviewers, who themselves are clinicians, may not have enough technical knowledge to feel comfortable about accepting these articles for publication, and may also be compounded by fear of a lack of interest or understanding in the readership. On the other hand, AI and ML experts will not be familiar with the clinical aspects or relevance of their research and would not be able to pitch it at a level that would be acceptable to a Medical journal and its readership. This may result in a “reinforcement bias” of sorts, where only certain types of publications from a few expert centers and which revolve around common themes are published repeatedly and in different forms in Medical journals, whereas significant developments in AI and ML which may have the potential for changing clinical practice are missed out. The same technical terms specific to these publications will also be mentioned repeatedly, while novel approaches and new technical terms unfamiliar to clinicians may never see publication in a medical journal. The endoscopy readership may already have been “overfitted” towards polyp detection and characterization in the endoscopy literature[75], while neglecting the fact that, as mentioned, the use of AI in colonoscopy to date has utilized only an extremely limited aspect of AI and in a very narrow clinical context. Including computer science experts in the editorship and as reviewers for Medical journals may help to bridge the gap in these technical and publication biases.
Physician sentiment towards AI
Physician sentiment is a significant determinant on how quickly technologies and recommendations are deployed in a clinical setting. A recently conducted online survey among Gastroenterologists in the United States showed high overall interest in CADe and perception that it would increase ADR (85.5% and 75.8%, respectively)[76]. However, the same survey also showed that majority of the respondents felt that CADe will prolong the time taken per colonoscopy, despite evidence to the contrary[9,24,25,27-29].
Concerns about operator dependence, or “deskilling”, of the Endoscopist due to reliance on CADe and CADx for detection and characterization of polyps, respectively, are also mentioned in this survey[76] and other reviews[44,73]. Another major concern shown in the survey by Wadhwa et al[76] was the perceived increase in cost per procedure (75.2%). While concerns such as withdrawal time have been addressed independently in several RCT, others such as operator dependance and cost-effectiveness have not studied. Hence, physician sentiment may be another significant gap in AI which needs to be bridged in the field of colonoscopy.
Medicolegal challenges and future directions
AI algorithms which utilize DL are considered “black box” models, meaning that it is almost impossible to trace the decision-making process which led to the output determined by the algorithm when faced with a specific task (e.g., polyp or no polyp in the image, hyperplastic or adenomatous). One of the major gaps in clinical use of AI systems in colonoscopy is medicolegal liability when a misdiagnosis or missed diagnosis occurs. While a clinician’s account of events and the accompanying documentation can be helped up to scrutiny, the black box nature of DL algorithms means that the root cause and mitigating factors surrounding such a case may never be elucidated or even discovered. This has ethical implications in the event of harm to a patient[77], particularly if no clear protocol exists to define how an AI system should interface with its user and what its limits are, as the error may be due to deviation from safe use of the system or from an error of the AI system itself[78].
As AI systems, like other healthcare interventions, may have unpredictable errors, this inability to explain the errors or to detect them as they occur due to their black box nature may result in a perpetuation of systemic errors with unknown clinical implications if they are scaled up rapidly for routine clinical use in all colonoscopies. It is also unknown if the liability rests with the manufacturer, the regulatory body approving its use, or the clinician interfacing with the AI system. Having a reliable and accountable post-deployment surveillance plan is perhaps one of the strategies to minimize this risk.
Lastly, while AI systems have been shown to improve various quality indices associated with colonoscopy, one should remember that they are still limited most of all by our current expertise in this field. A useful example to illustrate this is the fact that there is currently no AI system capable of detecting dysplasia in UC. The availability of DCNN with high computing power and hardware to support the required processing speeds would have made this a rather simple task from an ML point of view. However, the optimal method of surveillance for dysplasia in UC and its optical features do not have the same clinical certainty as colorectal polyps in CRC screening, with resultant discrepancies in surveillance and biopsy practices[79,80]. Moreover, there is wide interobserver variability in the histological diagnosis of dysplasia in UC[81] and an inadequate understanding of its pathogenesis[82]. It is therefore understandable that there would be a paucity of expertly labelled data for “dysplasia” and “non-dysplasia” controls in UC patients for the training of an ML algorithm. Similarly, other potential AI applications in colonoscopy could include localization of diverticular bleeding and an automated scoring system for adequacy of bowel preparation which includes the BPPS[83] and the newly validated Colon Endoscopic Bubble Scale[84]. The clinical expertise and research in these fields must progress sufficiently for an accompanying increase in standardized and labelled data to be available for such future AI systems to be trained on and to materialize.