Changing the RS had a significant impact on model accuracy for GS
RNFL-Map and GS
B-Scan; however, it did not have a significant impact on model AUC. This can be seen especially for model accuracy on GS
RNFL-Map and GS
B-Scan using RS1
RNFL-Map and RS1
B-Scan vs. using RS4
RNFL-Map and RS4
B-Scan, respectively. For RNFL map model CNN A, accuracy on GS
RNFL-Map was 80.7% with RS1
RNFL-Map and was 80.0% with RS4
RNFL-Map (significant,
P = 2.30 × 10
−5, Wilcoxon signed rank test). For b-scan model CNN B, accuracy on GS
B-Scan was 72.4% with RS1
B-Scan and was 70.1% with RS4
B-Scan (not significant,
P = 0.166, Wilcoxon signed rank test), and for CNN C, accuracy on GS
B-Scan was 74.0% with RS1
B-Scan and was 76.4% with RS4
B-Scan (significant,
P = 0.002, Wilcoxon signed rank test).
Figure 6 shows ROC curves for RNFL map input and for b-scan input with varying RS. For the RNFL map model, AUC was highest when RS was RS1
RNFL-Map (Hood report) at 0.903 (95% CI, 0.845–0.961), while RS4
RNFL-Map (consensus of experts) resulted in slightly lower AUC of 0.891 (95% CI, 0.831–0.951) (
Fig. 6, left). This difference was not significant (
P = 0.790, DeLong's test). The best-performing b-scan model (CNN C) parallels this, with the highest AUC of 0.871 (95% CI, 0.795–0.931) for RS1
B-Scan (cpRNFL reports) and slightly lower AUC of 0.863 (95% CI, 0.809–0.933) for RS4
B-Scan (consensus of experts),
Figure 6 (right). This difference was not significant (
P = 0.790, DeLong's test). Evident from
Table 1 and these ROC curves is that CNN performance is highest when the RS used for acquiring ratings on model training data is the same as the RS used for acquiring ratings on model testing data (RS1
RNFL-Map/RS1
B-Scan in our case), as shown by the red rectangles in
Table 1. RS4
RNFL-Map and RS4
B-Scan result in minor reduction in performance for CNN A as well as for CNN B (green rectangles,
Table 1), because the consensus RS was only used for acquiring ratings for model test data, but was not used during model training.