Validating AI models on datasets distinct from the trained set is crucial to assess the generalizability of the model. Although all of the included studies conducted internal validation, utilizing methods such as
k-fold cross-validation or random data splitting into training/test sets, only three studies performed external validation.
28,29,35 CNNs are susceptible to overfitting if not appropriately regularized, leading to drastic accuracy changes when used with data different from the trained dataset.
66 Interestingly, Bitton et al.
32 attempted to validate a prior deep learning pipeline using an independent out-of-sample dataset and reported comparable satisfactory performance in line with the original result.
31 This pipeline initially employed a computer vision approach to quantify corneal EF from AS-OCT images and then utilized this value to classify FECD severity. Despite the encouraging classification performance, it became evident that the same EF was not universally applicable, requiring recalibration for different datasets. Studies in other fields have demonstrated that even models exhibiting adequate performance on internal validation may exhibit significant decreases in sensitivity and specificity during external validation,
67 and this decline in performance was evident in the algorithm by Foo et al.
28 Beyond a computer vision model for the anterior segment, external validation of AI models in ophthalmology presents several challenges. Data heterogeneity remains a key concern, as models trained on homogeneous populations often underperform on diverse ethnicities or imaging devices.
68 Image variability, including field width, quality, and magnification, also significantly impacts performance across datasets.
69 Standardization is another issue, given inconsistencies in data collection, imaging protocols, and annotation methods.
70 Ideally, addressing these issues requires diverse training datasets and rigorous validation on larger external datasets that represent the general population.
69