VF testing suffers from intrasubject variability, complicating the diagnosis of glaucoma progression.
45 The best way of assessing VF reliability is through test–retest setups. In the Ocular Hypertension Treatment Study, VF abnormalities were not confirmed in 86% of the original reliable VF exams.
46 As Guo et al.
14 state, it becomes harder to assess actual performance improvements in VF modeling from OCT, given that the ground-truth VF is noisy. Lazaridis and colleagues
43 reported an
R2 of 88% between single VF sensitivity values and a median VF sensitivity computed over up to 10 visits within 3 months. Artes et al.
3 computed repeat VF on 49 glaucomatous eyes, and they published 5th and 95th limits for VF threshold values. Their confidence intervals provide additional evidence that VF points with lower recorded dB values hold more variability than those with higher dB values. Two studies on 24-2 VF estimation from OCT provide a comparison of measured versus predicted VF threshold values, which can be placed next to the empirical 90% CI of Artes et al.
3 The recent DL study by Lazaridis et al.
43 was not included in the comparison as their boxplots from
Figure 2D used an α level of 0.05.
Figure 3 showcases the prediction variability for all three studies (current, Guo et al.,
14 and Zhu et al.
12). The 90% CI in the current study shows that no systematic overprediction of dB values occurred, with 33 of 38 whiskers (87%) falling within the shaded area. This represents a significant improvement over the previous result of 58% by Guo et al.
14 (
P = 0.00256). The interquartile ranges of the boxplots for threshold values smaller than 10 dB seem larger in the current study than the ones in Guo et al.,
14 which is most likely because of the challenging sample in our study. These results confirm that future performance improvement in the current model will be hard to detect, as almost all predictions fall within the empirically determined VF ranges. In such a context of noisy ground truth, it is preferable to adhere to the empirically determined CI instead of focusing on MAE.