The present study showed that applying deep learning to color-coded corneal maps of Scheimpflug images can accurately and objectively classify eyes into the three studied categories: normal, keratoconus, and subclinical keratoconus. This classification is central when screening refractive surgery candidates to avoid the risk of postoperative ectasia.
8–12
Several studies have confirmed the high accuracy of machine learning models for keratoconus screening using indices measured with Placido disk-based corneal topography or a Scheimpflug camera.
14–24,26 These studies used topographic and numeric indices that describe the corneal shape for machine learning approaches. In accordance with Arbelaez et al.,
18 the data in our study show high diagnostic accuracy of the SVM classifier with some limitation in performance in discriminating eyes with subclinical keratoconus. However, direct comparison with the work of Arbelaez et al.
18 is not possible. Compared to these numeric values, the corneal color-coded maps obtained by Scheimpflug cameras add more spatial information to CNNs as the amount of image data is broken down into pixel values and summarized by the reading kernel. This allows picking characteristic image features in more detail, resulting in substantial improvement in classification performance compared to SVM. Kamiya et al.
28 used transfer learning by applying the publicly available, pretrained network ResNet-18 to six color-coded AS-OCT maps (anterior elevation, anterior curvature, posterior elevation, posterior curvature, total refractive power, and pachymetry map) to discriminate between normal and keratoconic eyes with an accuracy of 0.99 and to further classify the keratoconus stage with an accuracy of 0.87. They used the arithmetic mean of the CNN output from each of the six AS-OCT component maps to classify the whole image; thus, their study cannot be directly compared with ours. In the present study, we used a simpler CNN with a less computationally intensive architecture of 13 layers which yielded the highest accuracy with initially different random seeds at the beginning of each training session. We also used different and larger datasets sourced from a more commonly used machine
13 for early keratoconus detection. Also, we used different classifications, especially with the use of the whole four-map image for classification in addition to the component maps. We proved that when the whole four-map display was used, the CNN yielded accuracy values of 0.98, 0.99, and 0.98 for the K, N, and S classes, respectively, during training/validation; these values remained high (0.989) for the test set, indicating the added value of the four-map composite image compared with results yielded by each solitary map.
Our ROC analysis results suggest a lower value of anterior elevation maps for keratoconus detection. This result is consistent with that reported by Ishii et al.,
41 who suggested a greater diagnostic value of the posterior elevation measurement, thus supporting the adoption of CNNs as a classification tool in clinical practice.
33,37 This offers much-needed confidence in the procedure, permitting physicians to authenticate predictions made by the network and ensuring that predictions are not influenced by extraneous factors. Qualitative evaluation of model behavior via CAMs has provided insight into the most influential image pixels of the model to be used to guide classification decisions. These maps provided compelling evidence that the CNN classification is most easily influenced by clinically relevant spatial regions; however, they may generalize beyond the training area. These findings are consistent with those reported by Dunnmon et al.
42 in their assessment of CNNs for automated classification of chest radiographs; they also noted that, although clinically meaningful spatial regions influence CNN classification, these models occasionally generalize beyond their triage task area meant for classification, resulting in false-positive or false-negative errors.
Our study has several limitations. The dataset used for both training/validation and testing was sourced from the same institution. Thus, generalizing our findings to other institutions should be considered with caution because of differences in image quality, data preprocessing, image labeling, sample weights, or other confounding factors that could lead to a higher error rate. Another limitation is ambiguity in defining subclinical keratoconus, forme fruste, and borderline cases, which should represent a corneal tomography spectrum including all patients who are at high postoperative risk for worsening ectasia.
43 The balanced class distribution in this study is far from representing real-life prevalences and was selected to prevent model bias during training/validation and to facilitate portraying the class performance by all available metrics. We also noted a trend toward overfitting of the CNN to training data, but we did not assess the number of images needed to prevent this during training. Instead of using tens of thousands of images to prevent overfitting, we employed the well-known technique of image augmentation to assess the trained network on a small dataset with random perturbations. This provided satisfactory model performance while avoiding overfitting. However, our simple but powerful domain-specific CNN architecture ensures flexibility regarding the downsampling of high-resolution input images while optimizing the computational cost to be comparable to that of general-purpose central processing units. This would be challenging for standard transfer learning via pre-trained CNNs that usually require high-performance GPUs.